Abstract:
This paper describes a mechanism for the automatic classification of static analysis warnings using machine learning methods. Static analysis is a tool for detecting potential vulnerabilities and bugs in source code. However, static analyzers often generate a large number of warnings, including both true and false positives. Manually analyzing all the defects found by the analyzer is a labor-intensive and time-consuming task. The developed automatic classification mechanism demonstrated high precision of more than 93% with a recall of about 96% on a set of warnings generated by the industrial static analysis tool Svace during the analysis of real-world projects. The dataset for the machine learning model is generated based on the warnings and source code metrics obtained during the static analysis of the project. The paper explores various approaches to feature selection and processing for the classifier, taking into account the characteristics of different machine learning algorithms. The mechanism’s efficiency and its independence from the programming language allowed it to be integrated into the industrial static analysis tool Svace. Various approaches to integrating the tool were considered, accounting for the specifics of the static analyzer, and the most convenient one was selected.