Chinese word segmentation method for short Chinese text based on conditional random fields

LIU Zewen, DING Dong, LI Chunwen

Journal of Tsinghua University(Science and Technology) ›› 2015, Vol. 55 ›› Issue (8) : 906-910,915.

PDF(1443 KB)
PDF(1443 KB)
Journal of Tsinghua University(Science and Technology) ›› 2015, Vol. 55 ›› Issue (8) : 906-910,915.
COMPUTER SCIENCE AND TECHNOLOGY

Chinese word segmentation method for short Chinese text based on conditional random fields

  • {{article.zuoZhe_EN}}
Author information +
History +

Abstract

Chinese word segmentation is a prerequisite for information retrieval. With the arrival of big data, information retrieval needs more precise word segmentation and recall. This paper presents a Chinese word segmentation method for short Chinese texts. The method first uses a conditional random field model to label the words with special tags to obtain preliminary results. Then, it uses the traditional dictionary-based method to improve the initial result to complete the word segmentation. This method improves recognition of “out of vocabulary” words and overlap ambiguities over the traditional method, with F-Scores over 0.95 with the 4 corpora of the Sighan 2005 bakeoff. Tests show that this method is better for short text Chinese word segmentation for information retrieval.

Key words

Chinese word segmentation / conditional random field (CRF) / machine learning

Cite this article

Download Citations
LIU Zewen, DING Dong, LI Chunwen. Chinese word segmentation method for short Chinese text based on conditional random fields[J]. Journal of Tsinghua University(Science and Technology). 2015, 55(8): 906-910,915

References

[1] SUN Xu, ZHANG Yaozhong, Matsuzaki T, et al. Probabilistic Chinese word segmentation with non-local information and stochastic training [J]. Information Processing & Management, 2013, 49(3): 626-636.
[2] Lafferty J, Mccallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data [C]// ACM, Proceedings of the 18th International Conference on Machine Learning. Williamstown, MA, USA: Scholarly Commons, 2001: 282-289.
[3] ZHANG Meishan, DENG Zhilong, CHE Wanxiang, et al. Combining statistical model and dictionary for domain adaption of Chinese word segmentation [J]. Journal of Chinese Information Processing, 2012, 26(2): 8-12.
[4] Fosler-Lussier E, HE Yanzhang, Jyothi P, et al. Conditional random fields in speech, audio, and language processing [J]. Proceedings of the IEEE, 2013, 101(5): 1054-1075.
[5] YANG Yanfeng, YANG Yanqin, GUAN Hu, et al. Out-of-vocabulary words recognition based on conditional random field in electronic commerce [J]. Lecture Notes in Computer Science, 2014, 8835: 532-539.
[6] Chellappa R, Fain A, Chellappa R, et al. Markov Random Fields: Theory and Applications [M]. San Diego, CA, USA: Academic Press Inc., 1993.
[7] CHEN Lei, LI Miao, ZHANG Jian, et al. A double-layer word segmentation combined with local ambiguity word grid and CRF [J]. Transactions on Computer Science & Technology, 2013 (1): 1-8.
[8] Ray A, Chandawala A, Chaudhury S. Character Recognition Using Conditional Random Field Based Recognition Engine [C]// IEEE, Proceedings of 12th International Conference on Document Analysis and Recognition. Washington DC, USA: IEEE Computer Society, 2013: 18-22.
[9] Sha F, Pereira F, Shallow parsing with conditional random fields [C]// ACL, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Boston, MA, USA: Association for Computational Linguistics, 2003, 1: 134-141.
[10] Nocedal J, Updating quasi-Newton matrices with limited storage [J]. Mathematics of computation, 1980, 35(151): 773-782.
[11]ZHAO Hai, HUANG Changning, LI Mu. An improved Chinese word segmentation system with conditional random field [C]// ACL, 2006a. Proceedings of the Fifth Sighan Workshop on Chinese Language Processing. Sydney, Australia: Association for Computational Linguistics, 2006: 162-165.
[12]Tseng H, Chang P, Andrew G, et al. A conditional random field word segmenter for Sighan bakeoff 2005 [C]// ACL, Proceedings of the Fourth Sighan Workshop on Chinese Language Processing. Jeju Island, Korea: Association for Computational Linguistics, 2005: 168-171.
[13]PENG Fuchuan, FENG Fangfang, Mccallum A. Chinese segmentation and new word detection using conditional random fields [C]// Proceedings of Coling 2004. Genera, Switzerland, 2004: 562-568.
PDF(1443 KB)

Accesses

Citation

Detail

Sections
Recommended

/