源代码缺陷检测数据生成及标注方法

doi:10.16511/j.cnki.qhdxxb.2021.21.005

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF(2493 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要在已有的基于深度学习的源代码缺陷检测方法中，所使用的训练数据和测试数据大多来源于仅供学术研究的测试源码，无法为深度学习模型的训练提供足够的数据支撑。因此，该文提出了一种源代码缺陷检测数据生成及标注方法。该方法在提取源代码控制流关系的基础上，应用已训练的深度学习模型和商业工具来完成源代码切片数据的标注。使用公开数据集SARD、NVD及开源软件Ffmpeg等进行验证，结果表明通过该方法能够生成直接用于深度学习的源代码缺陷检测数据集，为基于深度学习的源代码缺陷检测方法提供了数据支撑。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章
	管志斌
	王晓萌
	辛伟
	王嘉捷

关键词 ：源代码缺陷检测, 控制流, 数据生成, 样本标注, 深度学习

Abstract：Existing deep learning based source code vulnerability detection methods use training and test data sets that are mostly derived from test source codes for academic research only which do not provide sufficient support for training of deep learning models. This paper presents a data generation and annotation method for source code defect detection. This method extracts the source code control flow relationships and uses trained deep learning models and commercial tools to complete the slice data annotation of the source code. The public data sets SARD, NVD and the open-source code Ffmpeg are utilized to verify the system performance. The results show that this method can generate a source code defect dataset for deep learning to support deep learning-based source code vulnerability detection methods.

Key words： source code defect detection control flow data generation data annotation deep learning

收稿日期: 2020-11-23 出版日期: 2021-10-19

基金资助:国家自然科学基金资助项目（U1736110，U1836209，U1936211，U1836113，U1936101）

引用本文:

管志斌, 王晓萌, 辛伟, 王嘉捷. 源代码缺陷检测数据生成及标注方法[J]. 清华大学学报（自然科学版）, 2021, 61(11): 1240-1245.
GUAN Zhibin, WANG Xiaomeng, XIN Wei, WANG Jiajie. Data generation and annotation method for source code defect detection. Journal of Tsinghua University(Science and Technology), 2021, 61(11): 1240-1245.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2021.21.005 或 http://jst.tsinghuajournals.com/CN/Y2021/V61/I11/1240

[1] National Institute of Standards and Technology. Software assurance reference dataset[DB/OL].[2020-08-15]. https://samate.nist.gov/SRD/index.php.
[2] National Institute of Standards and Technology. National vulnerability database[DB/OL].[2020-08-15]. https://nvd.nist.gov/.
[3] KAMIYA T, KUSUMOTO S, INOUE K. Ccfinder:A multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7):654-670.
[4] WHITE M, TUFANO M, VENDOME C, et al. Deep learning code fragments for code clone detection[C]//201631st IEEE/ACM International Conference on Automated Software Engineering. Singapore:IEEE, 2016:87-98.
[5] SAJNANI H, SAINI V, SVAJLENKO J, et al. Sourcerercc:Scaling code clone detection to big-code[C]//Proceedings of the 38th International Conference on Software Engineering. New York, NY, USA:IEEE, 2016:1157-1168.
[6] WEI H H, LI M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code[C]//Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. Melbourne, Australia:Elsevier, 2017:3034-3040.
[7] TANTITHAMTHAVORN C, MCINTOSH S, HASSAN A E, et al. An empirical comparison of model validation techniques for defect prediction models[J]. IEEE Transactions on Software Engineering, 2017, 43(1):1-18.
[8] D'AMBROS M, LANZA M, ROBBES R. Evaluating defect prediction approaches:A benchmark and an extensive comparison[J]. Empirical Software Engineering, 2012, 17(4):531-577.
[9] 王晓萌, 张涛, 辛伟, 等. 深度学习源代码缺陷检测方法[J]. 北京理工大学学报, 2019, 39:1155-1159. WANG X M, ZHANG T, XIN W, et al. Source code defect detection based on deep learning[J]. Transactions of Beijing Institute of Technology, 2019, 39:1155-1159. (in Chinese)
[10] ZHOU J, ZHANG H, LO D. Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports[C]//2012 34th International Conference on Software Engineering. Zurich, Switzerland, IEEE, 2012:14-24.
[11] 曲泷玉, 贾依真, 郝永乐. 结合CNN和文本语义的漏洞自动分类方法[J]. 北京理工大学学报, 2019, 39:738-742. QU L Y, JIA Y Z, HAO Y L. Automatic classification of vulnerabilities based on CNN and Text semantics[J]. Transactions of Beijing Institute of Technology, 2019, 39:738-742. (in Chinese)
[12] BUCH L, ANDRZEJAK A. Learning-based recursive aggregation of abstract syntax trees for code clone detection[C]//2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering. Hangzhou, China:IEEE, 2019:95-104.
[13] DAM H K, PHAM T, NG S W, et al. A deep tree-based model for software defect prediction[Z/OL]. (2018-02-03). https://arxiv.org/abs/1802.00921.
[14] ALLAMANIS M, BROCKSCHMIDT M, KHADEMI M. Learning to represent programs with graphs[C/OL]. ICLR 2018. (2018-05-04). https://arxiv.org/abs/1711.00740.
[15] HARER J A, KIM L Y, RUSSELL R, et al. Automated software vulnerability detection with machine learning[Z/OL]. (2018-08-02). https://arxiv.org/abs/1803.04497.
[16] ZHANG J, WANG X, ZHANG H, et al. A novel neural source code representation based on abstract syntax tree[C]//Proceedings of the 41st International Conference on Software Engineering. Montreal, QC, Canada:IEEE, 2019:783-794.
[17] LI Z, ZOU D, XU S, et al. Sysevr:A framework for using deep learning to detect software vulnerabilities[Z/OL]. (2018-09-21). https://arxiv.org/abs/1807.06756.
[18] LI Z, ZOU D, TANG J, et al. A comparative study of deep learning-based vulnerability detection system[J]. IEEE Access, 2019, 7:103184-103197.
[19] LI Z, ZOU D, XU S, et al. Vuldeepecker:A deep learning-based system for vulnerability detection[Z/OL]. (2018-01-05). https://arxiv.org/abs/1801.01681.
[20] YAMAGUCHI F, GOLDE N, ARP D, et al. Modeling and discovering vulnerabilities with code property graphs[C]//2014 IEEE Symposium on Security and Privacy. Berkeley, CA, USA:IEEE Computer Society, 2014:590-604.

[1]	黄贲, 康飞, 唐玉. 基于目标检测的混凝土坝裂缝实时检测方法[J]. 清华大学学报（自然科学版）, 2023, 63(7): 1078-1086.
[2]	苗旭鹏, 张敏旭, 邵蓥侠, 崔斌. PS-Hybrid: 面向大规模推荐模型训练的混合通信框架[J]. 清华大学学报（自然科学版）, 2022, 62(9): 1417-1425.
[3]	梅杰, 李庆斌, 陈文夫, 邬昆, 谭尧升, 刘春风, 王东民, 胡昱. 基于目标检测模型的混凝土坯层覆盖间歇时间超时预警[J]. 清华大学学报（自然科学版）, 2021, 61(7): 688-693.
[4]	韩坤, 潘海为, 张伟, 边晓菲, 陈春伶, 何舒宁. 基于多模态医学图像的Alzheimer病分类方法[J]. 清华大学学报（自然科学版）, 2020, 60(8): 664-671,682.
[5]	王志国, 章毓晋. 监控视频异常检测：综述[J]. 清华大学学报（自然科学版）, 2020, 60(6): 518-529.
[6]	蒋文斌, 王宏斌, 刘湃, 陈雨浩. 基于AVX2指令集的深度学习混合运算策略[J]. 清华大学学报（自然科学版）, 2020, 60(5): 408-414.
[7]	余传明, 原赛, 胡莎莎, 安璐. 基于深度学习的多语言跨领域主题对齐模型[J]. 清华大学学报（自然科学版）, 2020, 60(5): 430-439.
[8]	宋欣瑞, 张宪琦, 张展, 陈新昊, 刘宏伟. 多传感器数据融合的复杂人体活动识别[J]. 清华大学学报（自然科学版）, 2020, 60(10): 814-821.
[9]	马锐, 高浩然, 窦伯文, 王夏菁, 胡昌振. 基于改进GN算法的程序控制流图划分方法[J]. 清华大学学报（自然科学版）, 2019, 59(1): 15-22.
[10]	张思聪, 谢晓尧, 徐洋. 基于dCNN的入侵检测方法[J]. 清华大学学报（自然科学版）, 2019, 59(1): 44-52.
[11]	芦效峰, 蒋方朔, 周箫, 崔宝江, 伊胜伟, 沙晶. 基于API序列特征和统计特征组合的恶意样本检测框架[J]. 清华大学学报（自然科学版）, 2018, 58(5): 500-508.
[12]	张新钰, 高洪波, 赵建辉, 周沫. 基于深度学习的自动驾驶技术综述[J]. 清华大学学报（自然科学版）, 2018, 58(4): 438-444.
[13]	王丽娜, 周伟康, 刘维杰, 余荣威. 面向云平台的硬件辅助ROP检测方法[J]. 清华大学学报（自然科学版）, 2018, 58(3): 237-242.
[14]	邹权臣, 张涛, 吴润浦, 马金鑫, 李美聪, 陈晨, 侯长玉. 从自动化到智能化:软件漏洞挖掘技术进展[J]. 清华大学学报（自然科学版）, 2018, 58(12): 1079-1094.
[15]	张敏, 丁弼原, 马为之, 谭云志, 刘奕群, 马少平. 基于深度学习加强的混合推荐方法[J]. 清华大学学报（自然科学版）, 2017, 57(10): 1014-1021.

Viewed

Full text

Abstract

Cited

Shared

Discussed