VULNERABILITY ANALUSIS AND RISK ASSESSMENT |
|
|
|
|
|
Data generation and annotation method for source code defect detection |
GUAN Zhibin, WANG Xiaomeng, XIN Wei, WANG Jiajie |
China Information Technology Security Evaluation Center, Beijing 100085, China |
|
|
Abstract Existing deep learning based source code vulnerability detection methods use training and test data sets that are mostly derived from test source codes for academic research only which do not provide sufficient support for training of deep learning models. This paper presents a data generation and annotation method for source code defect detection. This method extracts the source code control flow relationships and uses trained deep learning models and commercial tools to complete the slice data annotation of the source code. The public data sets SARD, NVD and the open-source code Ffmpeg are utilized to verify the system performance. The results show that this method can generate a source code defect dataset for deep learning to support deep learning-based source code vulnerability detection methods.
|
Keywords
source code defect detection
control flow
data generation
data annotation
deep learning
|
Issue Date: 19 October 2021
|
|
|
[1] National Institute of Standards and Technology. Software assurance reference dataset[DB/OL].[2020-08-15]. https://samate.nist.gov/SRD/index.php. [2] National Institute of Standards and Technology. National vulnerability database[DB/OL].[2020-08-15]. https://nvd.nist.gov/. [3] KAMIYA T, KUSUMOTO S, INOUE K. Ccfinder:A multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7):654-670. [4] WHITE M, TUFANO M, VENDOME C, et al. Deep learning code fragments for code clone detection[C]//201631st IEEE/ACM International Conference on Automated Software Engineering. Singapore:IEEE, 2016:87-98. [5] SAJNANI H, SAINI V, SVAJLENKO J, et al. Sourcerercc:Scaling code clone detection to big-code[C]//Proceedings of the 38th International Conference on Software Engineering. New York, NY, USA:IEEE, 2016:1157-1168. [6] WEI H H, LI M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code[C]//Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. Melbourne, Australia:Elsevier, 2017:3034-3040. [7] TANTITHAMTHAVORN C, MCINTOSH S, HASSAN A E, et al. An empirical comparison of model validation techniques for defect prediction models[J]. IEEE Transactions on Software Engineering, 2017, 43(1):1-18. [8] D'AMBROS M, LANZA M, ROBBES R. Evaluating defect prediction approaches:A benchmark and an extensive comparison[J]. Empirical Software Engineering, 2012, 17(4):531-577. [9] 王晓萌, 张涛, 辛伟, 等. 深度学习源代码缺陷检测方法[J]. 北京理工大学学报, 2019, 39:1155-1159. WANG X M, ZHANG T, XIN W, et al. Source code defect detection based on deep learning[J]. Transactions of Beijing Institute of Technology, 2019, 39:1155-1159. (in Chinese) [10] ZHOU J, ZHANG H, LO D. Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports[C]//2012 34th International Conference on Software Engineering. Zurich, Switzerland, IEEE, 2012:14-24. [11] 曲泷玉, 贾依真, 郝永乐. 结合CNN和文本语义的漏洞自动分类方法[J]. 北京理工大学学报, 2019, 39:738-742. QU L Y, JIA Y Z, HAO Y L. Automatic classification of vulnerabilities based on CNN and Text semantics[J]. Transactions of Beijing Institute of Technology, 2019, 39:738-742. (in Chinese) [12] BUCH L, ANDRZEJAK A. Learning-based recursive aggregation of abstract syntax trees for code clone detection[C]//2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering. Hangzhou, China:IEEE, 2019:95-104. [13] DAM H K, PHAM T, NG S W, et al. A deep tree-based model for software defect prediction[Z/OL]. (2018-02-03). https://arxiv.org/abs/1802.00921. [14] ALLAMANIS M, BROCKSCHMIDT M, KHADEMI M. Learning to represent programs with graphs[C/OL]. ICLR 2018. (2018-05-04). https://arxiv.org/abs/1711.00740. [15] HARER J A, KIM L Y, RUSSELL R, et al. Automated software vulnerability detection with machine learning[Z/OL]. (2018-08-02). https://arxiv.org/abs/1803.04497. [16] ZHANG J, WANG X, ZHANG H, et al. A novel neural source code representation based on abstract syntax tree[C]//Proceedings of the 41st International Conference on Software Engineering. Montreal, QC, Canada:IEEE, 2019:783-794. [17] LI Z, ZOU D, XU S, et al. Sysevr:A framework for using deep learning to detect software vulnerabilities[Z/OL]. (2018-09-21). https://arxiv.org/abs/1807.06756. [18] LI Z, ZOU D, TANG J, et al. A comparative study of deep learning-based vulnerability detection system[J]. IEEE Access, 2019, 7:103184-103197. [19] LI Z, ZOU D, XU S, et al. Vuldeepecker:A deep learning-based system for vulnerability detection[Z/OL]. (2018-01-05). https://arxiv.org/abs/1801.01681. [20] YAMAGUCHI F, GOLDE N, ARP D, et al. Modeling and discovering vulnerabilities with code property graphs[C]//2014 IEEE Symposium on Security and Privacy. Berkeley, CA, USA:IEEE Computer Society, 2014:590-604. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|