Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  百年期刊
Journal of Tsinghua University(Science and Technology)    2015, Vol. 55 Issue (8) : 906-910,915     DOI:
COMPUTER SCIENCE AND TECHNOLOGY |
Chinese word segmentation method for short Chinese text based on conditional random fields
LIU Zewen1, DING Dong1, LI Chunwen2
1. Institute of Microelectronics, Tsinghua University, Beijing 100084, China;
2. Department of Automation, Tsinghua University, Beijing 100084, China
Download: PDF(1443 KB)  
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks    
Abstract  Chinese word segmentation is a prerequisite for information retrieval. With the arrival of big data, information retrieval needs more precise word segmentation and recall. This paper presents a Chinese word segmentation method for short Chinese texts. The method first uses a conditional random field model to label the words with special tags to obtain preliminary results. Then, it uses the traditional dictionary-based method to improve the initial result to complete the word segmentation. This method improves recognition of “out of vocabulary” words and overlap ambiguities over the traditional method, with F-Scores over 0.95 with the 4 corpora of the Sighan 2005 bakeoff. Tests show that this method is better for short text Chinese word segmentation for information retrieval.
Keywords Chinese word segmentation      conditional random field (CRF)      machine learning     
ZTFLH:  TP391.1  
Issue Date: 15 August 2015
Service
E-mail this article
E-mail Alert
RSS
Articles by authors
LIU Zewen
DING Dong
LI Chunwen
Cite this article:   
LIU Zewen,DING Dong,LI Chunwen. Chinese word segmentation method for short Chinese text based on conditional random fields[J]. Journal of Tsinghua University(Science and Technology), 2015, 55(8): 906-910,915.
URL:  
http://jst.tsinghuajournals.com/EN/     OR     http://jst.tsinghuajournals.com/EN/Y2015/V55/I8/906
   
   
   
   
   
   
   
   
[1] SUN Xu, ZHANG Yaozhong, Matsuzaki T, et al. Probabilistic Chinese word segmentation with non-local information and stochastic training [J]. Information Processing & Management, 2013, 49(3): 626-636.
[2] Lafferty J, Mccallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data [C]// ACM, Proceedings of the 18th International Conference on Machine Learning. Williamstown, MA, USA: Scholarly Commons, 2001: 282-289.
[3] ZHANG Meishan, DENG Zhilong, CHE Wanxiang, et al. Combining statistical model and dictionary for domain adaption of Chinese word segmentation [J]. Journal of Chinese Information Processing, 2012, 26(2): 8-12.
[4] Fosler-Lussier E, HE Yanzhang, Jyothi P, et al. Conditional random fields in speech, audio, and language processing [J]. Proceedings of the IEEE, 2013, 101(5): 1054-1075.
[5] YANG Yanfeng, YANG Yanqin, GUAN Hu, et al. Out-of-vocabulary words recognition based on conditional random field in electronic commerce [J]. Lecture Notes in Computer Science, 2014, 8835: 532-539.
[6] Chellappa R, Fain A, Chellappa R, et al. Markov Random Fields: Theory and Applications [M]. San Diego, CA, USA: Academic Press Inc., 1993.
[7] CHEN Lei, LI Miao, ZHANG Jian, et al. A double-layer word segmentation combined with local ambiguity word grid and CRF [J]. Transactions on Computer Science & Technology, 2013 (1): 1-8.
[8] Ray A, Chandawala A, Chaudhury S. Character Recognition Using Conditional Random Field Based Recognition Engine [C]// IEEE, Proceedings of 12th International Conference on Document Analysis and Recognition. Washington DC, USA: IEEE Computer Society, 2013: 18-22.
[9] Sha F, Pereira F, Shallow parsing with conditional random fields [C]// ACL, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Boston, MA, USA: Association for Computational Linguistics, 2003, 1: 134-141.
[10] Nocedal J, Updating quasi-Newton matrices with limited storage [J]. Mathematics of computation, 1980, 35(151): 773-782.
[11]ZHAO Hai, HUANG Changning, LI Mu. An improved Chinese word segmentation system with conditional random field [C]// ACL, 2006a. Proceedings of the Fifth Sighan Workshop on Chinese Language Processing. Sydney, Australia: Association for Computational Linguistics, 2006: 162-165.
[12]Tseng H, Chang P, Andrew G, et al. A conditional random field word segmenter for Sighan bakeoff 2005 [C]// ACL, Proceedings of the Fourth Sighan Workshop on Chinese Language Processing. Jeju Island, Korea: Association for Computational Linguistics, 2005: 168-171.
[13]PENG Fuchuan, FENG Fangfang, Mccallum A. Chinese segmentation and new word detection using conditional random fields [C]// Proceedings of Coling 2004. Genera, Switzerland, 2004: 562-568.
[1] WU Hao, NIU Fenglei. Machine learning model of radiation heat transfer in the high-temperature nuclear pebble bed[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(8): 1213-1218.
[2] DAI Xin, HUANG Hong, JI Xinyu, WANG Wei. Spatiotemporal rapid prediction model of urban rainstorm waterlogging based on machine learning[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(6): 865-873.
[3] REN Jianqiang, CUI Yapeng, NI Shunjiang. Prediction method of the pandemic trend of COVID-19 based on machine learning[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(6): 1003-1011.
[4] AN Jian, CHEN Yuxuan, SU Xingyu, ZHOU Hua, REN Zhuyin. Applications and prospects of machine learning in turbulent combustion and engines[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(4): 462-472.
[5] ZHAO Qiming, BI Kexin, QIU Tong. Comparison and integration of machine learning based ethylene cracking process models[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(9): 1450-1457.
[6] CAO Laicheng, LI Yuntao, WU Rong, GUO Xian, FENG Tao. Multi-key privacy protection decision tree evaluation scheme[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 862-870.
[7] WANG Haojie, MA Zixuan, ZHENG Liyan, WANG Yuanwei, WANG Fei, ZHAI Jidong. Efficient memory allocator for the New Generation Sunway supercomputer[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 943-951.
[8] LU Sicong, LI Chunwen. Human-machine conversation system for chatting based on scene and topic[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 952-958.
[9] LI Wei, LI Chenglong, YANG Jiahai. As-Stream: An intelligent operator parallelization strategy for fluctuating data streams[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(12): 1851-1863.
[10] LIU Qiangmo, HE Xu, ZHOU Baishun, WU Haolin, ZHANG Chi, QIN Yu, SHEN Xiaomei, GAO Xiaorong. Simple and high performance classification model for autism based on machine learning and pupillary response[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(10): 1730-1738.
[11] MA Xiaoyue, MENG Xiao. Image position and layout effects of multi-image tweets from the perspective of user engagement[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(1): 77-87.
[12] TANG Zhili, WANG Xue, XU Qianjun. Rockburst prediction based on oversampling and objective weighting method[J]. Journal of Tsinghua University(Science and Technology), 2021, 61(6): 543-555.
[13] WANG Zhiguo, ZHANG Yujin. Anomaly detection in surveillance videos: A survey[J]. Journal of Tsinghua University(Science and Technology), 2020, 60(6): 518-529.
[14] SONG Yubo, QI Xinyu, HUANG Qiang, HU Aiqun, YANG Junjie. Two-stage multi-classification algorithm for Internet of Things equipment identification[J]. Journal of Tsinghua University(Science and Technology), 2020, 60(5): 365-370.
[15] LU Xiaofeng, JIANG Fangshuo, ZHOU Xiao, CUI Baojiang, YI Shengwei, SHA Jing. API based sequence and statistical features in a combined malware detection architecture[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(5): 500-508.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
Copyright © Journal of Tsinghua University(Science and Technology), All Rights Reserved.
Powered by Beijing Magtech Co. Ltd