基于API序列特征和统计特征组合的恶意样本检测框架

doi:10.16511/j.cnki.qhdxxb.2018.25.020

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF(3026 KB)
输出: BibTeX | EndNote (RIS)

摘要针对恶意样本行为分析，该文提出了一种组合机器学习框架，首先对应用程序编程接口（application programming interface，API）序列中调用的依赖关系进行功能层面上的分析，提取特征，使用随机森林进行检测；其次利用深度学习中的循环神经网络处理时间序列数据的特性，在冗余信息预处理的基础上，直接对序列进行学习和检测；最后对2种方法进行了组合。在恶意软件样本上进行的实验结果表明： 2种方法均可有效检测恶意样本，但是组合学习的效果更优，AUC （area under the curve of ROC）达到99.3%，优于现有的类似研究结果。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章
	芦效峰
	蒋方朔
	周箫
	崔宝江
	伊胜伟
	沙晶

关键词 ：计算机病毒与防治, 恶意样本检测, 机器学习, 深度学习, 调用序列

Abstract：This paper presents a combined machine learning framework for malware behavior analyses. One part of the framework analyzes the dependency relation in the API call sequence at the functional level to extract features to train and classify a random forest. The other part uses a recurrent neural network (RNN) to study the API sequence to identify malware with redundant information preprocessing using the RNN time series forecasting ability. Tests on a malware dataset show that both methods can effectively detect malwares. However, the combined framework is better with an AUC of 99.3%.

Key words： computer virus and prevention malware classification machine learning deep learning call sequence

收稿日期: 2017-08-15 出版日期: 2018-05-15

ZTFLH:

TP309.5

基金资助:国家自然科学基金资助项目（61472046，U1536122）；信息网络安全公安部重点实验室开放课题项目（C17607）；北京市科协“金桥工程种子资金”项目；中国计算机学会-绿盟科技鲲鹏基金项目（CCF-NSFOUS2017006）

作者简介: 芦效峰(1976-),男,副教授。E-mail:luxf@bupt.edu.cn

引用本文:

芦效峰, 蒋方朔, 周箫, 崔宝江, 伊胜伟, 沙晶. 基于API序列特征和统计特征组合的恶意样本检测框架[J]. 清华大学学报（自然科学版）, 2018, 58(5): 500-508.
LU Xiaofeng, JIANG Fangshuo, ZHOU Xiao, CUI Baojiang, YI Shengwei, SHA Jing. API based sequence and statistical features in a combined malware detection architecture. Journal of Tsinghua University(Science and Technology), 2018, 58(5): 500-508.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.25.020 或 http://jst.tsinghuajournals.com/CN/Y2018/V58/I5/500

图１系统框架

表１４ＧGram 子序列提取

表２４ＧGram 子序列连接

表３连续相同模式 API去除

图２深度学习模型结构

图４ WannaCry运行截图

图３ AMHARA关联分析算法示意图

表４ WannaCry调用 API列表 (前１０项)

表５混淆矩阵

图５ LSTM 实验

表７缩短序列长度对于训练时间的影响

表８不同分类的 AUC结果

表９随机森林参数优化结果

图６ (网络版彩图)随机森林回归实验

表１０不同算法的准确率比较

[1] WANG X Z, LIU J W, CHEN X E. Say no to overfitting. (2017-05-31). https://www.kaggle.com/c/malware-classification/discussion/13897.
[2] LIPTON Z C, BERKOWITZ J, ELKAN C. A critical review of recurrent neural networks for sequence learning[J]. arXiv preprint arXiv:1506.00019, 2015.
[3] 黄全伟. 基于N-Gram系统调用序列的恶意代码静态检测[D]. 哈尔滨:哈尔滨工业大学, 2009.HUANG Q W. Malicious executables detection based on N-Gram system call sequences[D]. Harbin:Harbin Institute of Technology, 2009.(in Chinese)
[4] 刘阳. 应用随机森林与神经网络算法检测与分析Android应用恶意样本[D]. 北京:北京交通大学, 2015.LIU Y. Employing the algorithms of random forest and neural networks for the detection and analysis of malicious code of Android applications[D]. Beijing:Beijing Jiaotong University, 2015. (in Chinese)
[5] 杨宏宇, 徐晋. 基于改进随机森林算法的Android恶意软件检测[J]. 通信学报, 2017(4):8-16.YANG H Y, XU J. Android malware detection based on improved random forest[J]. Journal on Communications, 2017(4):8-16. (in Chinese)
[6] 张家旺, 李燕伟. 基于机器学习算法的Android恶意程序检测系统[J]. 计算机应用研究, 2017(6):1-6.ZHANG J W, LI Y W. Malware detection system implementation of Android application based on machine learning[J]. Application Research of Computers, 2017(6):1-6. (in Chinese)
[7] SANTOS I, BREZO F, UGARTE-PEDRERO X, et al. Opcode sequences as representation of executables for data-mining-based unknown malware detection[J]. Information Sciences, 2013, 231:64-82.
[8] RAVI C, MANOHARAN R. Malware detection using windows API sequence and machine learning[J]. International Journal of Computer Applications, 2012, 43(17):12-16.
[9] 廖国辉, 刘嘉勇. 基于数据挖掘和机器学习的恶意代码检测方法[J]. 信息安全研究, 2016(1):74-79.LIAO G H, LIU J Y. A malicious code detection method based on data mining and machine learning[J]. Journal of Information Security Research, 2016(1):74-79. (in Chinese)
[10] DAHL G E, STOKES J W, DENG L, et al. Large-scale malware classification using random projections and neural networks[C]//2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, BC, Canada:IEEE, 2013:3422-3426.
[11] SAXE J, BERLIN K. Deep neural network based malware detection using two dimensional binary program features[C]//201510th International Conference on Malicious and Unwanted Software (MALWARE). Fajardo, Puerto Rico:IEEE, 2015:11-20.
[12] KOLOSNJAJI B, ZARRAS A, WEBSTER G, et al. Deep learning for classification of malware system call sequences[C]//Australasian Joint Conference on Artificial Intelligence. Hobart, TAS, Australia:Springer International Publishing, 2016:137-149.
[13] TOBIYAMA S, YAMAGUCHI Y, SHIMADA H, et al. Malware detection with deep neural network using process behavior[C]//201640th Annual IEEE Conference on Computer Software and Applications (COMPSAC). Atlanta, GA, USA:IEEE, 2016, 2:577-582.
[14] PASCANU R, STOKES J W, SANOSSIAN H, et al. Malware classification with recurrent networks[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brisbane, QLD, Australia:IEEE, 2015:1916-1920.
[15] Tensorflow.. (2017-05-31). https://www.tensorflow.org/,2017.
[16] VirusShare.. (2017-05-31). https://virusshare.com,2017.
[17] VirusTotal.. (2017-05-31). http://www.virustotal.com,2017.
[18] Scikit-Learn.. (2017-05-31). http://scikit-learn.org/,2017.

[1]	吴浩, 牛风雷. 高温球床辐射传热中的机器学习模型[J]. 清华大学学报（自然科学版）, 2023, 63(8): 1213-1218.
[2]	黄贲, 康飞, 唐玉. 基于目标检测的混凝土坝裂缝实时检测方法[J]. 清华大学学报（自然科学版）, 2023, 63(7): 1078-1086.
[3]	代鑫, 黄弘, 汲欣愉, 王巍. 基于机器学习的城市暴雨内涝时空快速预测模型[J]. 清华大学学报（自然科学版）, 2023, 63(6): 865-873.
[4]	任建强, 崔亚鹏, 倪顺江. 基于机器学习的新冠肺炎疫情趋势预测方法[J]. 清华大学学报（自然科学版）, 2023, 63(6): 1003-1011.
[5]	安健, 陈宇轩, 苏星宇, 周华, 任祝寅. 机器学习在湍流燃烧及发动机中的应用与展望[J]. 清华大学学报（自然科学版）, 2023, 63(4): 462-472.
[6]	苗旭鹏, 张敏旭, 邵蓥侠, 崔斌. PS-Hybrid: 面向大规模推荐模型训练的混合通信框架[J]. 清华大学学报（自然科学版）, 2022, 62(9): 1417-1425.
[7]	赵祺铭, 毕可鑫, 邱彤. 基于机器学习的乙烯裂解过程模型比较与集成[J]. 清华大学学报（自然科学版）, 2022, 62(9): 1450-1457.
[8]	曹来成, 李运涛, 吴蓉, 郭显, 冯涛. 多密钥隐私保护决策树评估方案[J]. 清华大学学报（自然科学版）, 2022, 62(5): 862-870.
[9]	王豪杰, 马子轩, 郑立言, 王元炜, 王飞, 翟季冬. 面向新一代神威超级计算机的高效内存分配器[J]. 清华大学学报（自然科学版）, 2022, 62(5): 943-951.
[10]	陆思聪, 李春文. 基于场景与话题的聊天型人机会话系统[J]. 清华大学学报（自然科学版）, 2022, 62(5): 952-958.
[11]	李维, 李城龙, 杨家海. As-Stream：一种针对波动数据流的算子智能并行化策略[J]. 清华大学学报（自然科学版）, 2022, 62(12): 1851-1863.
[12]	刘强墨, 何旭, 周佰顺, 吴昊霖, 张弛, 秦羽, 沈晓梅, 高小榕. 基于机器学习和瞳孔响应的简易高性能自闭症分类模型[J]. 清华大学学报（自然科学版）, 2022, 62(10): 1730-1738.
[13]	马晓悦, 孟啸. 用户参与视角下多图推文的图像位置和布局效应[J]. 清华大学学报（自然科学版）, 2022, 62(1): 77-87.
[14]	梅杰, 李庆斌, 陈文夫, 邬昆, 谭尧升, 刘春风, 王东民, 胡昱. 基于目标检测模型的混凝土坯层覆盖间歇时间超时预警[J]. 清华大学学报（自然科学版）, 2021, 61(7): 688-693.
[15]	汤志立, 王雪, 徐千军. 基于过采样和客观赋权法的岩爆预测[J]. 清华大学学报（自然科学版）, 2021, 61(6): 543-555.

Viewed

Full text

Abstract

Cited

Shared

Discussed