基于层次聚类的虚假用户检测

doi:10.16511/j.cnki.qhdxxb.2017.26.029

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF(1124 KB)
输出: BibTeX | EndNote (RIS)

摘要互联网上充斥着大量恶意用户，而互联网服务提供商通常有海量的注册用户，使得系统难以从中发现虚假账户。针对海量注册数据中，恶意用户批量注册的虚假账户通常具有相似性的特点。该文提出海量数据中定位虚假账户的系统模型，利用用户名字符串组成模式对海量数据进行预分类，进而对每个分类中元素计算字符串相似度，即计算字符串Levenshtein距离。设置合适的阈值，进行层次聚类分析，从而定位藏匿在海量注册数据中的成组的虚假账户。实验结果表明：该系统模型有效，与现有的模型相比，该系统对数据维度、数据特性依赖较小。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章
	方勇
	刘道胜
	黄诚

关键词 ：数据安全, 虚假账户, 机器学习, 层次聚类

Abstract：Since there are many malicious users on the Internet, popular online websites sometimes have millions of registered users. The system cannot easily distinguish between fake accounts and legitimate users. Fake accounts registered by a single malicious user often have similar profiles. This paper presents a new framework to find fake accounts in large numbers of users. The framework uses username string patterns to classify the original data and then calculates the similarity as measured by the Levenshtein distance between any two elements in one category. Hierarchical clustering with a proper threshold then finds groups of fake accounts hidden in the large amount of registration data. Tests demonstrate the effectiveness of this framework which algorithm relies less on data dimensions and features than other algorithms.

Key words： data security fake accounts machine learning hierarchical clustering

收稿日期: 2016-12-14 出版日期: 2017-06-15

ZTFLH:

TP309.2

引用本文:

方勇, 刘道胜, 黄诚. 基于层次聚类的虚假用户检测[J]. 清华大学学报（自然科学版）, 2017, 57(6): 620-624.
FANG Yong, LIU Daosheng, HUANG Cheng. Detecting of fake accounts with hierarchical clustering. Journal of Tsinghua University(Science and Technology), 2017, 57(6): 620-624.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2017.26.029 或 http://jst.tsinghuajournals.com/CN/Y2017/V57/I6/620

图１　基于层次聚类的虚假用户检测系统

图２　字符串模式识别算法

表１　Tianya用户名字符模式TOP５

表２　实验结果

表３　字符串模式LLLLLNNNNNNNNN 部分结果

[1]	Wang A H. Don't follow me:Spam detection in twitter[C]//Security and Cryptography (SECRYPT), Proceedings of the 2010 International Conference. Athens, Greece:IEEE Press, 2010:1-10.
[2]	Mohammad R M, Thabtah F, McCluskey L. Intelligent rule-based phishing websites classification[J]. IET Information Security, 2014, 8(3):153-160.
[3]	Marchal S, Saari K, Singh N, et al. Know your phish:Novel techniques for detecting phishing sites and their targets[C]//Distributed Computing Systems (ICDCS), 2016 IEEE 36th International Conference. Piscataway, NJ, USA:IEEE Press, 2016:323-333.
[4]	Malhotra A, Totti L, Meira Jr W, et al. Studying user footprints in different online social networks[C]//Advances in Social Networks Analysis and Mining (ASONAM), 2012 IEEE/ACM International Conference. Piscataway, NJ, USA:IEEE Press, 2012:1065-1070.
[5]	Cao Q, Sirivianos M, Yang X W, et al. Aiding the detection of fake accounts in large scale social online services[C]//Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. San Diego, CA, USA:USENIX Association, 2012:15-15.
[6]	Cao X, Freeman D M, Hwa T. Detecting clusters of fake accounts in online social networks[C]//Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security. Denver, CO, USA:ACM Press, 2015:91-101.
[7]	Fire M, Katz G, Elovici Y. Strangers intrusion detection-detecting spammers and fake profiles in social networks based on topology anomalies[J]. Human Journal, 2012, 1(1):26-39.
[8]	Jin L, Takabi H, Joshi J B D. Towards active detection of identity clone attacks on online social networks[C]//Proceedings of the first ACM conference on Data and application security and privacy. San Antonio, TX, USA:ACM, 2011:27-38.
[9]	CHENG Yang, QI Zhao. How to set and manage your network password:A multidimensional scheme of password reuse[C]//Conference on e-Business, e-Services and e-Society. Berlin:Springer, 2014:264-276.
[10]	Das A, Bonneau J, Caesar M, et al. The tangled web of password reuse[C]//Network and Distributed System Security Symposium. San Diego, CA, USA:The Internet Society, 2014:23-26.
[11]	Yang C, Hung J, Lin ZX. An analysis view on password patterns of Chinese Internet users[J]. Nankai Business Review International, 2013, 4(1):66-77.
[12]	LI Zhigong, HAN Weili, XU Wenyuan. A large-scale empirical analysis of Chinese Web passwords[C]//The 23rd USENIX Conference on Security Symposium. San Diego, CA, USA:USENIX Security, 2014:559-574.

[1]	吴浩, 牛风雷. 高温球床辐射传热中的机器学习模型[J]. 清华大学学报（自然科学版）, 2023, 63(8): 1213-1218.
[2]	代鑫, 黄弘, 汲欣愉, 王巍. 基于机器学习的城市暴雨内涝时空快速预测模型[J]. 清华大学学报（自然科学版）, 2023, 63(6): 865-873.
[3]	任建强, 崔亚鹏, 倪顺江. 基于机器学习的新冠肺炎疫情趋势预测方法[J]. 清华大学学报（自然科学版）, 2023, 63(6): 1003-1011.
[4]	安健, 陈宇轩, 苏星宇, 周华, 任祝寅. 机器学习在湍流燃烧及发动机中的应用与展望[J]. 清华大学学报（自然科学版）, 2023, 63(4): 462-472.
[5]	赵祺铭, 毕可鑫, 邱彤. 基于机器学习的乙烯裂解过程模型比较与集成[J]. 清华大学学报（自然科学版）, 2022, 62(9): 1450-1457.
[6]	曹来成, 李运涛, 吴蓉, 郭显, 冯涛. 多密钥隐私保护决策树评估方案[J]. 清华大学学报（自然科学版）, 2022, 62(5): 862-870.
[7]	王豪杰, 马子轩, 郑立言, 王元炜, 王飞, 翟季冬. 面向新一代神威超级计算机的高效内存分配器[J]. 清华大学学报（自然科学版）, 2022, 62(5): 943-951.
[8]	陆思聪, 李春文. 基于场景与话题的聊天型人机会话系统[J]. 清华大学学报（自然科学版）, 2022, 62(5): 952-958.
[9]	李维, 李城龙, 杨家海. As-Stream：一种针对波动数据流的算子智能并行化策略[J]. 清华大学学报（自然科学版）, 2022, 62(12): 1851-1863.
[10]	刘强墨, 何旭, 周佰顺, 吴昊霖, 张弛, 秦羽, 沈晓梅, 高小榕. 基于机器学习和瞳孔响应的简易高性能自闭症分类模型[J]. 清华大学学报（自然科学版）, 2022, 62(10): 1730-1738.
[11]	马晓悦, 孟啸. 用户参与视角下多图推文的图像位置和布局效应[J]. 清华大学学报（自然科学版）, 2022, 62(1): 77-87.
[12]	汤志立, 王雪, 徐千军. 基于过采样和客观赋权法的岩爆预测[J]. 清华大学学报（自然科学版）, 2021, 61(6): 543-555.
[13]	王志国, 章毓晋. 监控视频异常检测：综述[J]. 清华大学学报（自然科学版）, 2020, 60(6): 518-529.
[14]	宋宇波, 祁欣妤, 黄强, 胡爱群, 杨俊杰. 基于二阶段多分类的物联网设备识别算法[J]. 清华大学学报（自然科学版）, 2020, 60(5): 365-370.
[15]	芦效峰, 蒋方朔, 周箫, 崔宝江, 伊胜伟, 沙晶. 基于API序列特征和统计特征组合的恶意样本检测框架[J]. 清华大学学报（自然科学版）, 2018, 58(5): 500-508.

Viewed

Full text

Abstract

Cited

Shared

Discussed