Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  百年期刊
Journal of Tsinghua University(Science and Technology)    2021, Vol. 61 Issue (11) : 1240-1245     DOI: 10.16511/j.cnki.qhdxxb.2021.21.005
VULNERABILITY ANALUSIS AND RISK ASSESSMENT |
Data generation and annotation method for source code defect detection
GUAN Zhibin, WANG Xiaomeng, XIN Wei, WANG Jiajie
China Information Technology Security Evaluation Center, Beijing 100085, China
Download: PDF(2493 KB)   HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks    
Abstract  Existing deep learning based source code vulnerability detection methods use training and test data sets that are mostly derived from test source codes for academic research only which do not provide sufficient support for training of deep learning models. This paper presents a data generation and annotation method for source code defect detection. This method extracts the source code control flow relationships and uses trained deep learning models and commercial tools to complete the slice data annotation of the source code. The public data sets SARD, NVD and the open-source code Ffmpeg are utilized to verify the system performance. The results show that this method can generate a source code defect dataset for deep learning to support deep learning-based source code vulnerability detection methods.
Keywords source code defect detection      control flow      data generation      data annotation      deep learning     
Issue Date: 19 October 2021
Service
E-mail this article
E-mail Alert
RSS
Articles by authors
GUAN Zhibin
WANG Xiaomeng
XIN Wei
WANG Jiajie
Cite this article:   
GUAN Zhibin,WANG Xiaomeng,XIN Wei, et al. Data generation and annotation method for source code defect detection[J]. Journal of Tsinghua University(Science and Technology), 2021, 61(11): 1240-1245.
URL:  
http://jst.tsinghuajournals.com/EN/10.16511/j.cnki.qhdxxb.2021.21.005     OR     http://jst.tsinghuajournals.com/EN/Y2021/V61/I11/1240
  
  
  
  
  
  
  
[1] National Institute of Standards and Technology. Software assurance reference dataset[DB/OL].[2020-08-15]. https://samate.nist.gov/SRD/index.php.
[2] National Institute of Standards and Technology. National vulnerability database[DB/OL].[2020-08-15]. https://nvd.nist.gov/.
[3] KAMIYA T, KUSUMOTO S, INOUE K. Ccfinder:A multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7):654-670.
[4] WHITE M, TUFANO M, VENDOME C, et al. Deep learning code fragments for code clone detection[C]//201631st IEEE/ACM International Conference on Automated Software Engineering. Singapore:IEEE, 2016:87-98.
[5] SAJNANI H, SAINI V, SVAJLENKO J, et al. Sourcerercc:Scaling code clone detection to big-code[C]//Proceedings of the 38th International Conference on Software Engineering. New York, NY, USA:IEEE, 2016:1157-1168.
[6] WEI H H, LI M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code[C]//Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. Melbourne, Australia:Elsevier, 2017:3034-3040.
[7] TANTITHAMTHAVORN C, MCINTOSH S, HASSAN A E, et al. An empirical comparison of model validation techniques for defect prediction models[J]. IEEE Transactions on Software Engineering, 2017, 43(1):1-18.
[8] D'AMBROS M, LANZA M, ROBBES R. Evaluating defect prediction approaches:A benchmark and an extensive comparison[J]. Empirical Software Engineering, 2012, 17(4):531-577.
[9] 王晓萌, 张涛, 辛伟, 等. 深度学习源代码缺陷检测方法[J]. 北京理工大学学报, 2019, 39:1155-1159. WANG X M, ZHANG T, XIN W, et al. Source code defect detection based on deep learning[J]. Transactions of Beijing Institute of Technology, 2019, 39:1155-1159. (in Chinese)
[10] ZHOU J, ZHANG H, LO D. Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports[C]//2012 34th International Conference on Software Engineering. Zurich, Switzerland, IEEE, 2012:14-24.
[11] 曲泷玉, 贾依真, 郝永乐. 结合CNN和文本语义的漏洞自动分类方法[J]. 北京理工大学学报, 2019, 39:738-742. QU L Y, JIA Y Z, HAO Y L. Automatic classification of vulnerabilities based on CNN and Text semantics[J]. Transactions of Beijing Institute of Technology, 2019, 39:738-742. (in Chinese)
[12] BUCH L, ANDRZEJAK A. Learning-based recursive aggregation of abstract syntax trees for code clone detection[C]//2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering. Hangzhou, China:IEEE, 2019:95-104.
[13] DAM H K, PHAM T, NG S W, et al. A deep tree-based model for software defect prediction[Z/OL]. (2018-02-03). https://arxiv.org/abs/1802.00921.
[14] ALLAMANIS M, BROCKSCHMIDT M, KHADEMI M. Learning to represent programs with graphs[C/OL]. ICLR 2018. (2018-05-04). https://arxiv.org/abs/1711.00740.
[15] HARER J A, KIM L Y, RUSSELL R, et al. Automated software vulnerability detection with machine learning[Z/OL]. (2018-08-02). https://arxiv.org/abs/1803.04497.
[16] ZHANG J, WANG X, ZHANG H, et al. A novel neural source code representation based on abstract syntax tree[C]//Proceedings of the 41st International Conference on Software Engineering. Montreal, QC, Canada:IEEE, 2019:783-794.
[17] LI Z, ZOU D, XU S, et al. Sysevr:A framework for using deep learning to detect software vulnerabilities[Z/OL]. (2018-09-21). https://arxiv.org/abs/1807.06756.
[18] LI Z, ZOU D, TANG J, et al. A comparative study of deep learning-based vulnerability detection system[J]. IEEE Access, 2019, 7:103184-103197.
[19] LI Z, ZOU D, XU S, et al. Vuldeepecker:A deep learning-based system for vulnerability detection[Z/OL]. (2018-01-05). https://arxiv.org/abs/1801.01681.
[20] YAMAGUCHI F, GOLDE N, ARP D, et al. Modeling and discovering vulnerabilities with code property graphs[C]//2014 IEEE Symposium on Security and Privacy. Berkeley, CA, USA:IEEE Computer Society, 2014:590-604.
[1] HUANG Ben, KANG Fei, TANG Yu. A real-time detection method for concrete dam cracks based on an object detection algorithm[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(7): 1078-1086.
[2] MIAO Xupeng, ZHANG Minxu, SHAO Yingxia, CUI Bin. PS-Hybrid: Hybrid communication framework for large recommendation model training[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(9): 1417-1425.
[3] ZHOU Mingyue, GONG Chen, LI Zhenghua, ZHANG Min. Comparison of data annotation approaches using dependency tree annotation as a case study[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 908-916.
[4] MEI Jie, LI Qingbin, CHEN Wenfu, WU Kun, TAN Yaosheng, LIU Chunfeng, WANG Dongmin, HU Yu. Overtime warning of concrete pouring interval based on object detection model[J]. Journal of Tsinghua University(Science and Technology), 2021, 61(7): 688-693.
[5] HAN Kun, PAN Haiwei, ZHANG Wei, BIAN Xiaofei, CHEN Chunling, HE Shuning. Alzheimer's disease classification method based on multi-modal medical images[J]. Journal of Tsinghua University(Science and Technology), 2020, 60(8): 664-671,682.
[6] WANG Zhiguo, ZHANG Yujin. Anomaly detection in surveillance videos: A survey[J]. Journal of Tsinghua University(Science and Technology), 2020, 60(6): 518-529.
[7] JIANG Wenbin, WANG Hongbin, LIU Pai, CHEN Yuhao. Hybrid computational strategy for deep learning based on AVX2[J]. Journal of Tsinghua University(Science and Technology), 2020, 60(5): 408-414.
[8] YU Chuanming, YUAN Sai, HU Shasha, AN Lu. Deep learning multi-language topic alignment model across domains[J]. Journal of Tsinghua University(Science and Technology), 2020, 60(5): 430-439.
[9] SONG Xinrui, ZHANG Xianqi, ZHANG Zhan, CHEN Xinhao, LIU Hongwei. Multi-sensor data fusion for complex human activity recognition[J]. Journal of Tsinghua University(Science and Technology), 2020, 60(10): 814-821.
[10] MA Rui, GAO Haoran, DOU Bowen, WANG Xiajing, HU Changzhen. Control flow graph division based on an improved GN algorithm[J]. Journal of Tsinghua University(Science and Technology), 2019, 59(1): 15-22.
[11] ZHANG Sicong, XIE Xiaoyao, XU Yang. Intrusion detection method based on a deep convolutional neural network[J]. Journal of Tsinghua University(Science and Technology), 2019, 59(1): 44-52.
[12] LU Xiaofeng, JIANG Fangshuo, ZHOU Xiao, CUI Baojiang, YI Shengwei, SHA Jing. API based sequence and statistical features in a combined malware detection architecture[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(5): 500-508.
[13] ZHANG Xinyu, GAO Hongbo, ZHAO Jianhui, ZHOU Mo. Overview of deep learning intelligent driving methods[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(4): 438-444.
[14] WANG Lina, ZHOU Weikang, LIU Weijie, YU Rongwei. Hardware-assisted ROP attack detection on cloud platforms[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 237-242.
[15] ZOU Quanchen, ZHANG Tao, WU Runpu, MA Jinxin, LI Meicong, CHEN Chen, HOU Changyu. From automation to intelligence: Survey of research on vulnerability discovery techniques[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(12): 1079-1094.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
Copyright © Journal of Tsinghua University(Science and Technology), All Rights Reserved.
Powered by Beijing Magtech Co. Ltd