Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2015, Vol. 55 Issue (8): 906-910,915    
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于条件随机场的中文短文本分词方法
刘泽文1, 丁冬1, 李春文2
1. 清华大学 微纳电子系, 北京 100084;
2. 清华大学 自动化系, 北京 100084
Chinese word segmentation method for short Chinese text based on conditional random fields
LIU Zewen1, DING Dong1, LI Chunwen2
1. Institute of Microelectronics, Tsinghua University, Beijing 100084, China;
2. Department of Automation, Tsinghua University, Beijing 100084, China
全文: PDF(1443 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 中文分词是信息检索工作的一项先决任务。随着大数据时代的到来, 信息检索工作对于中文分词准确率和召回率的要求也不断提高。该文提出了一种针对中文短文本的分词方法。该方法首先利用机器学习中的条件随机场模型对待处理的中文短文本进行初步分词, 然后再利用传统词典分词方法对初步分词结果进行修正, 从而完成分词工作。针对中文短文本的特点, 该方法在条件随机场的标记选择和特征模板编写上做了相应优化。测试结果表明, 该方法改善了传统的基于词典的分词法因为未登录词和交叠歧义而产生的准确率和召回率下降的问题, 并在Sighan bakeoff 2005的四个语料测试集中均取得了0.95以上的F-Score。实验证明: 该方法适合应用于信息检索领域的中文短文本分词工作。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘泽文
丁冬
李春文
关键词 中文分词条件随机场机器学习    
Abstract:Chinese word segmentation is a prerequisite for information retrieval. With the arrival of big data, information retrieval needs more precise word segmentation and recall. This paper presents a Chinese word segmentation method for short Chinese texts. The method first uses a conditional random field model to label the words with special tags to obtain preliminary results. Then, it uses the traditional dictionary-based method to improve the initial result to complete the word segmentation. This method improves recognition of “out of vocabulary” words and overlap ambiguities over the traditional method, with F-Scores over 0.95 with the 4 corpora of the Sighan 2005 bakeoff. Tests show that this method is better for short text Chinese word segmentation for information retrieval.
Key wordsChinese word segmentation    conditional random field (CRF)    machine learning
收稿日期: 2015-04-18      出版日期: 2015-09-30
ZTFLH:  TP391.1  
引用本文:   
刘泽文, 丁冬, 李春文. 基于条件随机场的中文短文本分词方法[J]. 清华大学学报(自然科学版), 2015, 55(8): 906-910,915.
LIU Zewen, DING Dong, LI Chunwen. Chinese word segmentation method for short Chinese text based on conditional random fields. Journal of Tsinghua University(Science and Technology), 2015, 55(8): 906-910,915.
链接本文:  
http://jst.tsinghuajournals.com/CN/  或          http://jst.tsinghuajournals.com/CN/Y2015/V55/I8/906
  图1 LCCRF模型
  表1 3种主流的标记方法
  表2 5-Tag方法
  表3 3种标记方法的性能对比
  表4 特征模板
  表5 特征模板对于分词结果的影响
  表6 多线程下训练程序的运行时间
  表7 Sighanbakeoff2005语料集测试结果
[1] SUN Xu, ZHANG Yaozhong, Matsuzaki T, et al. Probabilistic Chinese word segmentation with non-local information and stochastic training [J]. Information Processing & Management, 2013, 49(3): 626-636.
[2] Lafferty J, Mccallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data [C]// ACM, Proceedings of the 18th International Conference on Machine Learning. Williamstown, MA, USA: Scholarly Commons, 2001: 282-289.
[3] ZHANG Meishan, DENG Zhilong, CHE Wanxiang, et al. Combining statistical model and dictionary for domain adaption of Chinese word segmentation [J]. Journal of Chinese Information Processing, 2012, 26(2): 8-12.
[4] Fosler-Lussier E, HE Yanzhang, Jyothi P, et al. Conditional random fields in speech, audio, and language processing [J]. Proceedings of the IEEE, 2013, 101(5): 1054-1075.
[5] YANG Yanfeng, YANG Yanqin, GUAN Hu, et al. Out-of-vocabulary words recognition based on conditional random field in electronic commerce [J]. Lecture Notes in Computer Science, 2014, 8835: 532-539.
[6] Chellappa R, Fain A, Chellappa R, et al. Markov Random Fields: Theory and Applications [M]. San Diego, CA, USA: Academic Press Inc., 1993.
[7] CHEN Lei, LI Miao, ZHANG Jian, et al. A double-layer word segmentation combined with local ambiguity word grid and CRF [J]. Transactions on Computer Science & Technology, 2013 (1): 1-8.
[8] Ray A, Chandawala A, Chaudhury S. Character Recognition Using Conditional Random Field Based Recognition Engine [C]// IEEE, Proceedings of 12th International Conference on Document Analysis and Recognition. Washington DC, USA: IEEE Computer Society, 2013: 18-22.
[9] Sha F, Pereira F, Shallow parsing with conditional random fields [C]// ACL, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Boston, MA, USA: Association for Computational Linguistics, 2003, 1: 134-141.
[10] Nocedal J, Updating quasi-Newton matrices with limited storage [J]. Mathematics of computation, 1980, 35(151): 773-782.
[11]ZHAO Hai, HUANG Changning, LI Mu. An improved Chinese word segmentation system with conditional random field [C]// ACL, 2006a. Proceedings of the Fifth Sighan Workshop on Chinese Language Processing. Sydney, Australia: Association for Computational Linguistics, 2006: 162-165.
[12]Tseng H, Chang P, Andrew G, et al. A conditional random field word segmenter for Sighan bakeoff 2005 [C]// ACL, Proceedings of the Fourth Sighan Workshop on Chinese Language Processing. Jeju Island, Korea: Association for Computational Linguistics, 2005: 168-171.
[13]PENG Fuchuan, FENG Fangfang, Mccallum A. Chinese segmentation and new word detection using conditional random fields [C]// Proceedings of Coling 2004. Genera, Switzerland, 2004: 562-568.
[1] 芦效峰, 蒋方朔, 周箫, 崔宝江, 伊胜伟, 沙晶. 基于API序列特征和统计特征组合的恶意样本检测框架[J]. 清华大学学报(自然科学版), 2018, 58(5): 500-508.
[2] 张婧, 黄德根, 黄锴宇, 刘壮, 孟祥主. 基于λ-主动学习方法的中文微博分词[J]. 清华大学学报(自然科学版), 2018, 58(3): 260-265.
[3] 方勇, 刘道胜, 黄诚. 基于层次聚类的虚假用户检测[J]. 清华大学学报(自然科学版), 2017, 57(6): 620-624.
[4] 强茂山, 张东成, 江汉臣. 基于加速度传感器的建筑工人施工行为识别方法[J]. 清华大学学报(自然科学版), 2017, 57(12): 1338-1344.
[5] 李煦, 屠明, 吴超, 国雁萌, 纳跃跃, 付强, 颜永红. 基于NMF和FCRF的单通道语音分离[J]. 清华大学学报(自然科学版), 2017, 57(1): 84-88.
[6] 赵晶玲, 陈石磊, 曹梦晨, 崔宝江. 基于离线汇编指令流分析的恶意程序算法识别技术[J]. 清华大学学报(自然科学版), 2016, 56(5): 484-492.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn