Filtering Chinese microblog topics noise algorithm based on a semi-supervised model
TU Shouzhong1, YANG Jing2, ZHAO Lin3, ZHU Xiaoyan1
1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China; 2. CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 3. State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
Abstract:Social networking feeds often include much spam that includes marketing, recruitment or short articles without real content which negatively affect the user interest. The spam also seriously affects academic research and business applications. This paper presents an algorithm based on the pSVM-kNN model for filtering Chinese microblogging text noise to reduce the spam. This method combines the SVM and kNN algorithms. The kNN algorithm iteratively finds the optimal solution of the classification hyperplane in the local scope on the SVM computing hyperplane. Penalty costs and proportional weights are introduced into the SVM and kNN stages to improve the noise filtering and reduce misclassification. Tests on various size of real Sina Weibo datasets demonstrate that the precision and recall of this algorithm are significantly better than other methods with a remarkable improvement of the F-measure.
屠守中, 杨婧, 赵林, 朱小燕. 半监督的微博话题噪声过滤方法[J]. 清华大学学报(自然科学版), 2019, 59(3): 178-185.
TU Shouzhong, YANG Jing, ZHAO Lin, ZHU Xiaoyan. Filtering Chinese microblog topics noise algorithm based on a semi-supervised model. Journal of Tsinghua University(Science and Technology), 2019, 59(3): 178-185.
[1] ZHAO W X, JIANG J, WENG J S, et al. Comparing twitter and traditional media using topic models[C]//Proceedings of the 33rd European Conference on Advances in Information Retrieval. Dublin, Ireland:Springer, 2011:338-349. [2] ZHANG Y F. Incorporating phrase-level sentiment analysis on textual reviews for personalized recommendation[C]//Proceedings of the 8th ACM International Conference on Web Search and Data Mining. Shanghai, China:ACM, 2015:435-440. [3] LI D H, ZHANG Y Q, CHEN X, et al. Propagation regularity of hot topics in sina weibo based on SIR model-a simulation research[C]//Proceedings of 2014 IEEE Computing, Communications and Its Applications Conference. Beijing, China:IEEE, 2015:310-315. [4] 丁学君. 基于SCIR的微博舆情话题传播模型研究[J]. 计算机工程与应用, 2015, 51(8):20-26. DING X J. Research on propagation model of public opinion topics based on SCIR in microblogging[J]. Computer Engineering and Applications, 2015, 51(8):20-26. (in Chinese) [5] JIANG H C, LIN P, QIANG M S. Public-opinion sentiment analysis for large hydro projects[J]. Journal of Construction Engineering and Management, 2015, 142(2), 05015013. [6] 张玥, 孙霄凌, 朱庆华. 突发公共事件舆情传播特征与规律研究——以新浪微博和新浪新闻平台为例[J]. 情报杂志, 2014, 33(4):90-95. ZHANG Y, SUN X L, ZHU Q H. A study on communication features and rules of public opinions in public emergency:Taking Sina Microblog and Sina News platform as example[J]. Journal of Intelligence, 2014, 33(4):90-95. (in Chinese) [7] YIN Z B, ZHANG Y, CHEN W Y, et al. Discovering patterns of advertisement propagation in sina-microblog[C]//Proceedings of the 6th International Workshop on Data Mining for Online Advertising and Internet Economy. Beijing, China:ACM, 2012:1-6. [8] 谢丰, 彭勇, 陈思聪, 等. 微博安全问题战略对策研究[J]. 信息网络安全, 2013(4):87-90. XIE F, PENG Y, CHEN S C, et al. Security problems in the microblog and their solutions[J]. Netinfo Security, 2013(4):87-90. (in Chinese) [9] YIN C Y, XIANG J, ZHANG H, et al. A new SVM method for short text classification based on semi-supervised learning[C]//Proceedings of the 4th International Conference on Advanced Information Technology and Sensor Application. Harbin, China:IEEE, 2015:100-103. [10] SRIRAM B, FUHRY D, DEMIR E, et al. Short text classification in twitter to improve information filtering[C]//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva, Switzerland:ACM, 2010:841-842. [11] LEE K, CAVERLEE J, WEBB S. Uncovering social spammers:Social honeypots + machine learning[C]//Proceeding of the 33rdInternational ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva, Switzerland:ACM, 2010:435-442. [12] YANG C, HARKREADER R C, GU G F. Die free or live hard? Empirical evaluation and new design for fighting evolving twitter spammers[C]//Proceedings of the 14th international conference on Recent Advances in Intrusion Detection. Menlo Park, USA:ACM, 2011:318-337. [13] 丁兆云, 周斌, 贾焰, 等. 微博中基于统计特征与双向投票的垃圾用户发现[J]. 计算机研究与发展, 2013, 50(11):2336-2348. DING Z Y, ZHOU B, JIA Y, et al. Detecting spammers with a bidirectional vote algorithm based on statistical features in microblogs[J]. Journal of Computer Research and Development, 2013, 50(11):2336-2348. (in Chinese) [14] MILLER Z, DICKINSON B, DEITRICK W, et al. Twitter spammer detection using data stream clustering[J]. Information Sciences, 2014, 260:64-73. [15] WANG B, ZUBIAGA A, LIAKATA M, et al. Making the most of tweet-inherent features for social spam detection on twitter[J/OL]. (2015-03-25) https://arxiv.org/abs/1503.07405. [16] 姚子瑜, 屠守中, 黄民烈, 等. 一种半监督的中文垃圾微博过滤方法[J]. 中文信息学报, 2016, 30(5):176-186. YAO Z Y, TU S Z, HUANG M L, et al. A semi-supervised method for filtering Chinese spam tweets[J]. Journal of Chinese Information Processing, 2016, 30(5):176-186. (in Chinese) [17] 于然, 刘春阳, 靳小龙, 等. 基于多视角特征融合的中文垃圾微博过滤[J]. 山东大学学报(理学版), 2013, 48(11):53-58. YU R, LIU C Y, JIN X L, et al. Chinese spam microblog filtering based on the fusion of multi-angle features[J]. Journal of Shandong University (Natural Science), 2013, 48(11):53-58. (in Chinese) [18] WEI Z H, MIAO D Q, CHAUCHAT J H, et al. N-grams based feature selection and text representation for Chinese text classification[J]. International Journal of Computational Intelligence Systems, 2009, 2(4):365-374.