Abstract:Current manual segmented microblog-oriented corpora are inadequate, so both conventional Chinese word segmentation (CWS) systems and deep learning based CWS systems are still not very effective. This paper presents an active learning method that selects samples with high annotation values from unlabelled tweets for microblog-oriented CWS. A parameter is introduced to control the number of repeatedly selected samples that offen occur in microblog data. Three strategies (Max, Avg and AvgMax) are used to evaluate the overall values of each sample. The initial segment character is a stop character which is calculated by taking character embeddings into consideration. Tests demonstrate that this method outperforms the baseline system with F Gains of 0.84%~1.49% and state-of-the-art active learning method word boundary annotation (WBA).
[1] NGUYEN T H, SHIRAI K. Topic modeling based sentiment analysis on social media for stock market prediction[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China:ACL, 2015:1354-1364. [2] LIU X H, ZHOU M, WEI F R, et al. Joint inference of named entity recognition and normalization for tweets[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea:ACL, 2012:526-535. [3] LI C, LIU Y. Improving named entity recognition in tweets via detecting non-standard words[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China:ACL, 2015:929-938. [4] DONG G Z, LI R G, YANG W, et al. Microblog burst keywords detection based on social trust and dynamics model[J]. Chinese Journal of Electronics, 2014, 23(4):695-700. [5] QIU X P, QIAN P, YIN L S, et al. Overview of the NLPCC 2015 shared task:Chinese word segmentation and POS tagging for micro-blog texts[J]. Natural Language Processing and Chinese Computing. Berlin, Germany:Springer, 2015:541-549. [6] QIU X P, QIAN P, SHI Z. Overview of the NLPCC-ICCPOL 2016 shared task:Chinese word segmentation for micro-blog texts[J]. Natural Language Understanding and Intelligent Applications. Berlin, Germany:Springer, 2016:901-906. [7] TSENG H, CHANG P C, ANDREW G, et al. A conditional random field word Segmenter for SIGHAN bakeoff 2005[C]//Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. Jeju Island, Korea:ACL, 2005:168-171. [8] ZHANG H P, YU H K, XIONG D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS[C]//Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Sapporo, Japan:ACL, 2003:184-187. [9] 黄德根, 焦世斗, 周惠巍. 基于子词的双层CRFs中文分词[J]. 计算机研究与发展, 2010, 47(5):962-968. HUANG D G, JIAO S D, ZHOU H W. Dual-layer CRFs based on subword for Chinese word segmentation[J]. Journal of Computer Research and Development, 2010, 47(5):962-968. (in Chinese) [10] TANG M, LUO X Q, ROUKOS S. Active learning for statistical natural language parsing[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA:ACL, 2002:120-127. [11] LI S S, XUE Y X, WANG Z Q, et al. Active learning for cross-domain sentiment classification[C]//Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China:AAAI, 2013:2127-2133. [12] CHEN Y K, LASKO T A, MEI Q Z, et al. A study of active learning methods for named entity recognition in clinical text[J]. Journal of Biomedical Informatics, 2015, 58:11-18. [13] LI S S, ZHOU G D, HUANG C R. Active learning for Chinese word segmentation[C]//Proceedings of COLING 2012:Posters. New York, USA:ACM, 2012:683-692. [14] 梁喜涛, 顾磊. 基于最近邻的主动学习分词方法[J]. 计算机科学, 2015, 42(6):228-232, 261. LIANG X T, GU L. Active learning in Chinese word segmentation based on nearest neighbor[J]. Computer Science, 2015, 42(6):228-232, 261. (in Chinese) [15] 冯冲, 陈肇雄, 黄河燕, 等. 基于Multigram语言模型的主动学习中文分词[J]. 中文信息学报, 2006, 20(1):50-58. FENG C, CHEN Z X, HUANG H Y, et al. Active learning in Chinese word segmentation based on Multigram language model[J]. Journal of Chinese Information Processing, 2006, 20(1):50-58. (in Chinese) [16] SUN W W, XU J. Enhancing Chinese word segmentation using unlabeled data[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, UK:ACL, 2011:970-979. [17] ZHAO H, KIT C Y. Exploiting unlabeled text with different unsupervised segmentation criteria for Chinese word segmentation[J]. Research on Computing Science, 2008, 33:93-104. [18] MIKOLOV T, YIH W T, ZWEIG G. Linguistic regularities in continuous space word representations[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Atlanta, USA:ACL, 2013:746-751. [19] CHEN X X, XU L, LIU Z Y, et al. Joint learning of character and word embeddings[C]//Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina:AAAI, 2015:1236-1242.