半监督的微博话题噪声过滤方法

doi:10.16511/j.cnki.qhdxxb.2019.26.060

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF(3865 KB)
输出: BibTeX | EndNote (RIS)

摘要社交网络中存在大量营销、招聘等垃圾信息以及无实质内容的短文，为话题建模工作带来很多干扰，更严重影响社交网络方面的学术研究及商业应用。因此，该文提出了一种结合支持向量机与k近邻模型（pSVM-kNN）的半监督话题噪声过滤方法。该方法融合了SVM和kNN算法，在SVM计算得到超平面的基础上使用kNN算法在局部范围内迭代寻找分类超平面的最优解；同时为减少误分类发生，分别在SVM和kNN阶段引入惩罚代价和比例权重，以提高噪声过滤的效果。通过选取新浪微博中不同大小的数据集进行实验与其他方法进行比较，结果表明：该方法只利用了少量的标注样本进行训练，在准确率、召回率和F值方面均优于其他的对比方法。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章
	屠守中
	杨婧
	赵林
	朱小燕

关键词 ：社交网络, 支持向量机, k近邻, 噪声过滤, 惩罚代价

Abstract：Social networking feeds often include much spam that includes marketing, recruitment or short articles without real content which negatively affect the user interest. The spam also seriously affects academic research and business applications. This paper presents an algorithm based on the pSVM-kNN model for filtering Chinese microblogging text noise to reduce the spam. This method combines the SVM and kNN algorithms. The kNN algorithm iteratively finds the optimal solution of the classification hyperplane in the local scope on the SVM computing hyperplane. Penalty costs and proportional weights are introduced into the SVM and kNN stages to improve the noise filtering and reduce misclassification. Tests on various size of real Sina Weibo datasets demonstrate that the precision and recall of this algorithm are significantly better than other methods with a remarkable improvement of the F-measure.

Key words： social networks support vector machine k-nearest neighbor noise filtering penalty cost

收稿日期: 2018-08-22 出版日期: 2019-03-19

基金资助:国家自然科学基金资助项目（61332007，61303049）

引用本文:

屠守中, 杨婧, 赵林, 朱小燕. 半监督的微博话题噪声过滤方法[J]. 清华大学学报（自然科学版）, 2019, 59(3): 178-185.
TU Shouzhong, YANG Jing, ZHAO Lin, ZHU Xiaoyan. Filtering Chinese microblog topics noise algorithm based on a semi-supervised model. Journal of Tsinghua University(Science and Technology), 2019, 59(3): 178-185.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2019.26.060 或 http://jst.tsinghuajournals.com/CN/Y2019/V59/I3/178

表１微博噪声示例

图１标注样本训练结果

图２混合样本训练结果

图３超平面与样本点

表２微博话题分布情况

表３标注集与未标注集分布

表４含噪声微博比例分布

图４超平面与样本点

[1] ZHAO W X, JIANG J, WENG J S, et al. Comparing twitter and traditional media using topic models[C]//Proceedings of the 33rd European Conference on Advances in Information Retrieval. Dublin, Ireland:Springer, 2011:338-349.
[2] ZHANG Y F. Incorporating phrase-level sentiment analysis on textual reviews for personalized recommendation[C]//Proceedings of the 8th ACM International Conference on Web Search and Data Mining. Shanghai, China:ACM, 2015:435-440.
[3] LI D H, ZHANG Y Q, CHEN X, et al. Propagation regularity of hot topics in sina weibo based on SIR model-a simulation research[C]//Proceedings of 2014 IEEE Computing, Communications and Its Applications Conference. Beijing, China:IEEE, 2015:310-315.
[4] 丁学君. 基于SCIR的微博舆情话题传播模型研究[J]. 计算机工程与应用, 2015, 51(8):20-26. DING X J. Research on propagation model of public opinion topics based on SCIR in microblogging[J]. Computer Engineering and Applications, 2015, 51(8):20-26. (in Chinese)
[5] JIANG H C, LIN P, QIANG M S. Public-opinion sentiment analysis for large hydro projects[J]. Journal of Construction Engineering and Management, 2015, 142(2), 05015013.
[6] 张玥, 孙霄凌, 朱庆华. 突发公共事件舆情传播特征与规律研究——以新浪微博和新浪新闻平台为例[J]. 情报杂志, 2014, 33(4):90-95. ZHANG Y, SUN X L, ZHU Q H. A study on communication features and rules of public opinions in public emergency:Taking Sina Microblog and Sina News platform as example[J]. Journal of Intelligence, 2014, 33(4):90-95. (in Chinese)
[7] YIN Z B, ZHANG Y, CHEN W Y, et al. Discovering patterns of advertisement propagation in sina-microblog[C]//Proceedings of the 6th International Workshop on Data Mining for Online Advertising and Internet Economy. Beijing, China:ACM, 2012:1-6.
[8] 谢丰, 彭勇, 陈思聪, 等. 微博安全问题战略对策研究[J]. 信息网络安全, 2013(4):87-90. XIE F, PENG Y, CHEN S C, et al. Security problems in the microblog and their solutions[J]. Netinfo Security, 2013(4):87-90. (in Chinese)
[9] YIN C Y, XIANG J, ZHANG H, et al. A new SVM method for short text classification based on semi-supervised learning[C]//Proceedings of the 4th International Conference on Advanced Information Technology and Sensor Application. Harbin, China:IEEE, 2015:100-103.
[10] SRIRAM B, FUHRY D, DEMIR E, et al. Short text classification in twitter to improve information filtering[C]//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva, Switzerland:ACM, 2010:841-842.
[11] LEE K, CAVERLEE J, WEBB S. Uncovering social spammers:Social honeypots + machine learning[C]//Proceeding of the 33rdInternational ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva, Switzerland:ACM, 2010:435-442.
[12] YANG C, HARKREADER R C, GU G F. Die free or live hard? Empirical evaluation and new design for fighting evolving twitter spammers[C]//Proceedings of the 14th international conference on Recent Advances in Intrusion Detection. Menlo Park, USA:ACM, 2011:318-337.
[13] 丁兆云, 周斌, 贾焰, 等. 微博中基于统计特征与双向投票的垃圾用户发现[J]. 计算机研究与发展, 2013, 50(11):2336-2348. DING Z Y, ZHOU B, JIA Y, et al. Detecting spammers with a bidirectional vote algorithm based on statistical features in microblogs[J]. Journal of Computer Research and Development, 2013, 50(11):2336-2348. (in Chinese)
[14] MILLER Z, DICKINSON B, DEITRICK W, et al. Twitter spammer detection using data stream clustering[J]. Information Sciences, 2014, 260:64-73.
[15] WANG B, ZUBIAGA A, LIAKATA M, et al. Making the most of tweet-inherent features for social spam detection on twitter[J/OL]. (2015-03-25) https://arxiv.org/abs/1503.07405.
[16] 姚子瑜, 屠守中, 黄民烈, 等. 一种半监督的中文垃圾微博过滤方法[J]. 中文信息学报, 2016, 30(5):176-186. YAO Z Y, TU S Z, HUANG M L, et al. A semi-supervised method for filtering Chinese spam tweets[J]. Journal of Chinese Information Processing, 2016, 30(5):176-186. (in Chinese)
[17] 于然, 刘春阳, 靳小龙, 等. 基于多视角特征融合的中文垃圾微博过滤[J]. 山东大学学报(理学版), 2013, 48(11):53-58. YU R, LIU C Y, JIN X L, et al. Chinese spam microblog filtering based on the fusion of multi-angle features[J]. Journal of Shandong University (Natural Science), 2013, 48(11):53-58. (in Chinese)
[18] WEI Z H, MIAO D Q, CHAUCHAT J H, et al. N-grams based feature selection and text representation for Chinese text classification[J]. International Journal of Computational Intelligence Systems, 2009, 2(4):365-374.

[1]	张雪芹, 刘岗, 王智能, 罗飞, 吴建华. 基于多特征融合和深度学习的微观扩散预测[J]. 清华大学学报（自然科学版）, 2024, 64(4): 688-699.
[2]	朱唯一, 张雪芹, 顾春华. 基于EDLATrust算法的社交网络信息泄露节点概率预测[J]. 清华大学学报（自然科学版）, 2022, 62(2): 355-366.
[3]	陈庆强, 王文剑, 姜高霞. 基于数据分布的标签噪声过滤[J]. 清华大学学报（自然科学版）, 2019, 59(4): 262-269.
[4]	王绍卿, 李翠平, 王征, 陈红. 基于多重信任关系的微博转发行为预测[J]. 清华大学学报（自然科学版）, 2019, 59(4): 270-275.
[5]	吐松江·卡日, 高文胜, 张紫薇, 莫文雄, 王红斌, 崔屹平. 基于支持向量机和遗传算法的变压器故障诊断[J]. 清华大学学报（自然科学版）, 2018, 58(7): 623-629.
[6]	陈冬青, 张普含, 王华忠. 基于MIKPSO-SVM方法的工业控制系统入侵检测[J]. 清华大学学报（自然科学版）, 2018, 58(4): 380-386.
[7]	徐洪平, 刘洋, 易航, 阎小涛, 康健, 张文瑾. 运载火箭测发网络异常流量识别技术[J]. 清华大学学报（自然科学版）, 2018, 58(1): 20-26,34.
[8]	刘成颖, 吴昊, 王立平, 张智. 基于PSO优化LS-SVM的刀具磨损状态识别[J]. 清华大学学报（自然科学版）, 2017, 57(9): 975-979.
[9]	郭武, 张圣, 徐杰, 胡国平, 马啸空. 全变量系统和支持向量机结合的说话人确认[J]. 清华大学学报（自然科学版）, 2017, 57(3): 240-243.
[10]	赛牙热·依马木, 热依莱木·帕尔哈提, 艾斯卡尔·艾木都拉, 李志军. 基于不同关键词提取算法的维吾尔文本情感辨识[J]. 清华大学学报（自然科学版）, 2017, 57(3): 270-273.
[11]	辛喆, 邹若冰, 李升波, 俞佳莹, 戴一凡, 陈海亮. 基于超声波传感器阵列的车辆周围目标物识别[J]. 清华大学学报（自然科学版）, 2017, 57(12): 1287-1295.
[12]	严素蓉, 冯小青, 廖一星. 基于矩阵分解的社会化推荐模型[J]. 清华大学学报（自然科学版）, 2016, 56(7): 793-800.
[13]	杨殿阁, 何长伟, 李满, 何奇洸. 基于支持向量机的汽车转向与换道行为识别[J]. 清华大学学报（自然科学版）, 2015, 55(10): 1093-1097.
[14]	朱涵钰, 吴联仁, 吕廷杰. 社交网络用户隐私量化研究: 建模与实证分析[J]. 清华大学学报（自然科学版）, 2014, 54(3): 402-406.
[15]	张超, 刘奕, 张辉, 黄弘. 基于支持向量机的城市燃气日负荷预测方法研究[J]. 清华大学学报（自然科学版）, 2014, 54(3): 320-325.

Viewed

Full text

Abstract

Cited

Shared

Discussed