Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  百年期刊
Journal of Tsinghua University(Science and Technology)    2019, Vol. 59 Issue (3) : 178-185     DOI: 10.16511/j.cnki.qhdxxb.2019.26.060
COMPUTER SCIENCE AND TECHNOLOGY |
Filtering Chinese microblog topics noise algorithm based on a semi-supervised model
TU Shouzhong1, YANG Jing2, ZHAO Lin3, ZHU Xiaoyan1
1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
2. CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
3. State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
Download: PDF(3865 KB)  
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks    
Abstract  Social networking feeds often include much spam that includes marketing, recruitment or short articles without real content which negatively affect the user interest. The spam also seriously affects academic research and business applications. This paper presents an algorithm based on the pSVM-kNN model for filtering Chinese microblogging text noise to reduce the spam. This method combines the SVM and kNN algorithms. The kNN algorithm iteratively finds the optimal solution of the classification hyperplane in the local scope on the SVM computing hyperplane. Penalty costs and proportional weights are introduced into the SVM and kNN stages to improve the noise filtering and reduce misclassification. Tests on various size of real Sina Weibo datasets demonstrate that the precision and recall of this algorithm are significantly better than other methods with a remarkable improvement of the F-measure.
Keywords social networks      support vector machine      k-nearest neighbor      noise filtering      penalty cost     
Corresponding Authors: 朱小燕,教授,E-mail:zxy-dcs@tsinghua.edu.cn     E-mail: zxy-dcs@tsinghua.edu.cn
Issue Date: 19 March 2019
Service
E-mail this article
E-mail Alert
RSS
Articles by authors
TU Shouzhong
YANG Jing
ZHAO Lin
ZHU Xiaoyan
Cite this article:   
TU Shouzhong,YANG Jing,ZHAO Lin, et al. Filtering Chinese microblog topics noise algorithm based on a semi-supervised model[J]. Journal of Tsinghua University(Science and Technology), 2019, 59(3): 178-185.
URL:  
http://jst.tsinghuajournals.com/EN/10.16511/j.cnki.qhdxxb.2019.26.060     OR     http://jst.tsinghuajournals.com/EN/Y2019/V59/I3/178
  
  
  
  
  
  
  
  
[1] ZHAO W X, JIANG J, WENG J S, et al. Comparing twitter and traditional media using topic models[C]//Proceedings of the 33rd European Conference on Advances in Information Retrieval. Dublin, Ireland:Springer, 2011:338-349.
[2] ZHANG Y F. Incorporating phrase-level sentiment analysis on textual reviews for personalized recommendation[C]//Proceedings of the 8th ACM International Conference on Web Search and Data Mining. Shanghai, China:ACM, 2015:435-440.
[3] LI D H, ZHANG Y Q, CHEN X, et al. Propagation regularity of hot topics in sina weibo based on SIR model-a simulation research[C]//Proceedings of 2014 IEEE Computing, Communications and Its Applications Conference. Beijing, China:IEEE, 2015:310-315.
[4] 丁学君. 基于SCIR的微博舆情话题传播模型研究[J]. 计算机工程与应用, 2015, 51(8):20-26. DING X J. Research on propagation model of public opinion topics based on SCIR in microblogging[J]. Computer Engineering and Applications, 2015, 51(8):20-26. (in Chinese)
[5] JIANG H C, LIN P, QIANG M S. Public-opinion sentiment analysis for large hydro projects[J]. Journal of Construction Engineering and Management, 2015, 142(2), 05015013.
[6] 张玥, 孙霄凌, 朱庆华. 突发公共事件舆情传播特征与规律研究——以新浪微博和新浪新闻平台为例[J]. 情报杂志, 2014, 33(4):90-95. ZHANG Y, SUN X L, ZHU Q H. A study on communication features and rules of public opinions in public emergency:Taking Sina Microblog and Sina News platform as example[J]. Journal of Intelligence, 2014, 33(4):90-95. (in Chinese)
[7] YIN Z B, ZHANG Y, CHEN W Y, et al. Discovering patterns of advertisement propagation in sina-microblog[C]//Proceedings of the 6th International Workshop on Data Mining for Online Advertising and Internet Economy. Beijing, China:ACM, 2012:1-6.
[8] 谢丰, 彭勇, 陈思聪, 等. 微博安全问题战略对策研究[J]. 信息网络安全, 2013(4):87-90. XIE F, PENG Y, CHEN S C, et al. Security problems in the microblog and their solutions[J]. Netinfo Security, 2013(4):87-90. (in Chinese)
[9] YIN C Y, XIANG J, ZHANG H, et al. A new SVM method for short text classification based on semi-supervised learning[C]//Proceedings of the 4th International Conference on Advanced Information Technology and Sensor Application. Harbin, China:IEEE, 2015:100-103.
[10] SRIRAM B, FUHRY D, DEMIR E, et al. Short text classification in twitter to improve information filtering[C]//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva, Switzerland:ACM, 2010:841-842.
[11] LEE K, CAVERLEE J, WEBB S. Uncovering social spammers:Social honeypots + machine learning[C]//Proceeding of the 33rdInternational ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva, Switzerland:ACM, 2010:435-442.
[12] YANG C, HARKREADER R C, GU G F. Die free or live hard? Empirical evaluation and new design for fighting evolving twitter spammers[C]//Proceedings of the 14th international conference on Recent Advances in Intrusion Detection. Menlo Park, USA:ACM, 2011:318-337.
[13] 丁兆云, 周斌, 贾焰, 等. 微博中基于统计特征与双向投票的垃圾用户发现[J]. 计算机研究与发展, 2013, 50(11):2336-2348. DING Z Y, ZHOU B, JIA Y, et al. Detecting spammers with a bidirectional vote algorithm based on statistical features in microblogs[J]. Journal of Computer Research and Development, 2013, 50(11):2336-2348. (in Chinese)
[14] MILLER Z, DICKINSON B, DEITRICK W, et al. Twitter spammer detection using data stream clustering[J]. Information Sciences, 2014, 260:64-73.
[15] WANG B, ZUBIAGA A, LIAKATA M, et al. Making the most of tweet-inherent features for social spam detection on twitter[J/OL]. (2015-03-25) https://arxiv.org/abs/1503.07405.
[16] 姚子瑜, 屠守中, 黄民烈, 等. 一种半监督的中文垃圾微博过滤方法[J]. 中文信息学报, 2016, 30(5):176-186. YAO Z Y, TU S Z, HUANG M L, et al. A semi-supervised method for filtering Chinese spam tweets[J]. Journal of Chinese Information Processing, 2016, 30(5):176-186. (in Chinese)
[17] 于然, 刘春阳, 靳小龙, 等. 基于多视角特征融合的中文垃圾微博过滤[J]. 山东大学学报(理学版), 2013, 48(11):53-58. YU R, LIU C Y, JIN X L, et al. Chinese spam microblog filtering based on the fusion of multi-angle features[J]. Journal of Shandong University (Natural Science), 2013, 48(11):53-58. (in Chinese)
[18] WEI Z H, MIAO D Q, CHAUCHAT J H, et al. N-grams based feature selection and text representation for Chinese text classification[J]. International Journal of Computational Intelligence Systems, 2009, 2(4):365-374.
[1] ZHAO Qiming, BI Kexin, QIU Tong. Comparison and integration of machine learning based ethylene cracking process models[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(9): 1450-1457.
[2] ZHU Weiyi, ZHANG Xueqin, GU Chunhua. Social network information leakage node probability prediction based on the EDLATrust algorithm[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(2): 355-366.
[3] CHEN Qingqiang, WANG Wenjian, JIANG Gaoxia. Label noise filtering based on the data distribution[J]. Journal of Tsinghua University(Science and Technology), 2019, 59(4): 262-269.
[4] KARI·Tusongjiang, GAO Wensheng, ZHANG Ziwei, MO Wenxiong, WANG Hongbing, CUI Yiping. Power transformer fault diagnosis based on a support vector machine and a genetic algorithm[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(7): 623-629.
[5] CHEN Dongqing, ZHANG Puhan, WANG Huazhong. Intrusion detection for industrial control systems based on an improved SVM method[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(4): 380-386.
[6] XU Hongping, LIU Yang, YI Hang, YAN Xiaotao, KANG Jian, ZHANG Wenjin. Abnormal traffic flow identification for a measurement and control network for launch vehicles[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(1): 20-26,34.
[7] LIU Chengying, WU Hao, WANG Liping, ZHANG Zhi. Tool wear state recognition based on LS-SVM with the PSO algorithm[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(9): 975-979.
[8] GUO Wu, ZHANG Sheng, XU Jie, HU Guoping, MA Xiaokong. Speaker verification based on SVM and total variability[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(3): 240-243.
[9] IMAM Seyyare, PARHAT Rayilam, HAMDULLA Askar, LI Zhijun. Keyword extraction algorithms for emotion recognition from Uyghur text[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(3): 270-273.
[10] XIN Zhe, ZOU Ruobing, LI Shengbo, YU Jiaying, DAI Yifan, CHEN Hailiang. Target recognition around a vehicle based on an ultrasonic sensor array[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(12): 1287-1295.
[11] YAN Surong, FENG Xiaoqing, LIAO Yixing. Matrix factorization based social recommender model[J]. Journal of Tsinghua University(Science and Technology), 2016, 56(7): 793-800.
[12] SUN Liyuan, GUAN Xiaohong. Measurements of the competitive characteristics of multi-topic propagation in online social networks[J]. Journal of Tsinghua University(Science and Technology), 2015, 55(11): 1157-1162.
[13] YANG Diange, HE Changwei, LI Man, HE Qiguang. Vehicle steering and lane-changing behavior recognition based on a support vector machine[J]. Journal of Tsinghua University(Science and Technology), 2015, 55(10): 1093-1097.
[14] Hanyu ZHU, Lianren WU, Tingjie LU. Research on quantifying user privacy on social networking sites[J]. Journal of Tsinghua University(Science and Technology), 2014, 54(3): 402-406.
[15] Chao ZHANG, Yi LIU, Hui ZHANG, Hong HUANG. Study on urban short-term gas load forecasting based on support vector machine model[J]. Journal of Tsinghua University(Science and Technology), 2014, 54(3): 320-325.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
Copyright © Journal of Tsinghua University(Science and Technology), All Rights Reserved.
Powered by Beijing Magtech Co. Ltd