Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2018, Vol. 58 Issue (3): 260-265    DOI: 10.16511/j.cnki.qhdxxb.2018.26.011
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于λ-主动学习方法的中文微博分词
张婧, 黄德根, 黄锴宇, 刘壮, 孟祥主
大连理工大学 计算机科学与技术学院, 大连 116024
λ-active learning based microblog-oriented Chinese word segmentation
ZHANG Jing, HUANG Degen, HUANG Kaiyu, LIU Zhuang, MENG Xiangzhu
School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
全文: PDF(1051 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 由于面向中文微博的分词标注语料相对较少,导致基于传统方法和深度学习方法的中文分词系统在微博语料上的表现效果很差。针对此问题,该文提出一种主动学习方法,从大规模未标注语料中挑选更具标注价值的微博分词语料。根据微博语料的特点,在主动学习迭代过程中引入参数λ来控制所选的重复样例的个数,以确保所选样例的多样性;同时,根据样例中字标注结果的不确定性和上下文的多样性,采用Max、Avg和AvgMax这3种策略衡量样例整体的标注价值;此外,用于主动学习的初始分词器除使用当前字的上下文作为特征外,还利用字向量自动计算当前字成为停用字的可能性作为模型的特征。实验结果表明:该方法的F值比基线系统提高了0.84%~1.49%,比目前最优的基于词边界标注(word boundary annotation,WBA)的主动学习方法提升效果更好。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张婧
黄德根
黄锴宇
刘壮
孟祥主
关键词 文字信息处理中文分词主动学习样例多样性微博语料    
Abstract:Current manual segmented microblog-oriented corpora are inadequate, so both conventional Chinese word segmentation (CWS) systems and deep learning based CWS systems are still not very effective. This paper presents an active learning method that selects samples with high annotation values from unlabelled tweets for microblog-oriented CWS. A parameter is introduced to control the number of repeatedly selected samples that offen occur in microblog data. Three strategies (Max, Avg and AvgMax) are used to evaluate the overall values of each sample. The initial segment character is a stop character which is calculated by taking character embeddings into consideration. Tests demonstrate that this method outperforms the baseline system with F Gains of 0.84%~1.49% and state-of-the-art active learning method word boundary annotation (WBA).
Key wordsword information processing    Chinese word segmentation    active learning    diversity of samples    microblog-oriented data
收稿日期: 2017-08-25      出版日期: 2018-03-15
ZTFLH:  TP391.1  
基金资助:国家自然科学基金资助项目(61672127,61672126)
通讯作者: 黄德根,教授,E-mail:huangdg@dlut.edu.cn     E-mail: huangdg@dlut.edu.cn
作者简介: 张婧(1987-),女,博士研究生。
引用本文:   
张婧, 黄德根, 黄锴宇, 刘壮, 孟祥主. 基于λ-主动学习方法的中文微博分词[J]. 清华大学学报(自然科学版), 2018, 58(3): 260-265.
ZHANG Jing, HUANG Degen, HUANG Kaiyu, LIU Zhuang, MENG Xiangzhu. λ-active learning based microblog-oriented Chinese word segmentation. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 260-265.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.26.011  或          http://jst.tsinghuajournals.com/CN/Y2018/V58/I3/260
  图1 算法1
  图2 参数对字上下文差异性的影响
  表1 训练语料和测试语料的统计信息
  表2 初始分词器的分词结果
  图3 参数λ 对主动学习分词结果的影响
  图4 不同选择策略的分词结果
  表3 不同方法的分词结果的最佳F 值
  图5 字的标注结果的不确定性的衡量方法
[1] NGUYEN T H, SHIRAI K. Topic modeling based sentiment analysis on social media for stock market prediction[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China:ACL, 2015:1354-1364.
[2] LIU X H, ZHOU M, WEI F R, et al. Joint inference of named entity recognition and normalization for tweets[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea:ACL, 2012:526-535.
[3] LI C, LIU Y. Improving named entity recognition in tweets via detecting non-standard words[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China:ACL, 2015:929-938.
[4] DONG G Z, LI R G, YANG W, et al. Microblog burst keywords detection based on social trust and dynamics model[J]. Chinese Journal of Electronics, 2014, 23(4):695-700.
[5] QIU X P, QIAN P, YIN L S, et al. Overview of the NLPCC 2015 shared task:Chinese word segmentation and POS tagging for micro-blog texts[J]. Natural Language Processing and Chinese Computing. Berlin, Germany:Springer, 2015:541-549.
[6] QIU X P, QIAN P, SHI Z. Overview of the NLPCC-ICCPOL 2016 shared task:Chinese word segmentation for micro-blog texts[J]. Natural Language Understanding and Intelligent Applications. Berlin, Germany:Springer, 2016:901-906.
[7] TSENG H, CHANG P C, ANDREW G, et al. A conditional random field word Segmenter for SIGHAN bakeoff 2005[C]//Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. Jeju Island, Korea:ACL, 2005:168-171.
[8] ZHANG H P, YU H K, XIONG D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS[C]//Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Sapporo, Japan:ACL, 2003:184-187.
[9] 黄德根, 焦世斗, 周惠巍. 基于子词的双层CRFs中文分词[J]. 计算机研究与发展, 2010, 47(5):962-968. HUANG D G, JIAO S D, ZHOU H W. Dual-layer CRFs based on subword for Chinese word segmentation[J]. Journal of Computer Research and Development, 2010, 47(5):962-968. (in Chinese)
[10] TANG M, LUO X Q, ROUKOS S. Active learning for statistical natural language parsing[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA:ACL, 2002:120-127.
[11] LI S S, XUE Y X, WANG Z Q, et al. Active learning for cross-domain sentiment classification[C]//Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China:AAAI, 2013:2127-2133.
[12] CHEN Y K, LASKO T A, MEI Q Z, et al. A study of active learning methods for named entity recognition in clinical text[J]. Journal of Biomedical Informatics, 2015, 58:11-18.
[13] LI S S, ZHOU G D, HUANG C R. Active learning for Chinese word segmentation[C]//Proceedings of COLING 2012:Posters. New York, USA:ACM, 2012:683-692.
[14] 梁喜涛, 顾磊. 基于最近邻的主动学习分词方法[J]. 计算机科学, 2015, 42(6):228-232, 261. LIANG X T, GU L. Active learning in Chinese word segmentation based on nearest neighbor[J]. Computer Science, 2015, 42(6):228-232, 261. (in Chinese)
[15] 冯冲, 陈肇雄, 黄河燕, 等. 基于Multigram语言模型的主动学习中文分词[J]. 中文信息学报, 2006, 20(1):50-58. FENG C, CHEN Z X, HUANG H Y, et al. Active learning in Chinese word segmentation based on Multigram language model[J]. Journal of Chinese Information Processing, 2006, 20(1):50-58. (in Chinese)
[16] SUN W W, XU J. Enhancing Chinese word segmentation using unlabeled data[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, UK:ACL, 2011:970-979.
[17] ZHAO H, KIT C Y. Exploiting unlabeled text with different unsupervised segmentation criteria for Chinese word segmentation[J]. Research on Computing Science, 2008, 33:93-104.
[18] MIKOLOV T, YIH W T, ZWEIG G. Linguistic regularities in continuous space word representations[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Atlanta, USA:ACL, 2013:746-751.
[19] CHEN X X, XU L, LIU Z Y, et al. Joint learning of character and word embeddings[C]//Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina:AAAI, 2015:1236-1242.
[1] 刘泽文, 丁冬, 李春文. 基于条件随机场的中文短文本分词方法[J]. 清华大学学报(自然科学版), 2015, 55(8): 906-910,915.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn