COMPUTER SCIENCE AND TECHNOLOGY |
|
|
|
|
|
Chinese word segmentation method for short Chinese text based on conditional random fields |
LIU Zewen1, DING Dong1, LI Chunwen2 |
1. Institute of Microelectronics, Tsinghua University, Beijing 100084, China;
2. Department of Automation, Tsinghua University, Beijing 100084, China |
|
|
Abstract Chinese word segmentation is a prerequisite for information retrieval. With the arrival of big data, information retrieval needs more precise word segmentation and recall. This paper presents a Chinese word segmentation method for short Chinese texts. The method first uses a conditional random field model to label the words with special tags to obtain preliminary results. Then, it uses the traditional dictionary-based method to improve the initial result to complete the word segmentation. This method improves recognition of “out of vocabulary” words and overlap ambiguities over the traditional method, with F-Scores over 0.95 with the 4 corpora of the Sighan 2005 bakeoff. Tests show that this method is better for short text Chinese word segmentation for information retrieval.
|
Keywords
Chinese word segmentation
conditional random field (CRF)
machine learning
|
|
Issue Date: 15 August 2015
|
|
|
[1] SUN Xu, ZHANG Yaozhong, Matsuzaki T, et al. Probabilistic Chinese word segmentation with non-local information and stochastic training [J]. Information Processing & Management, 2013, 49(3): 626-636.
[2] Lafferty J, Mccallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data [C]// ACM, Proceedings of the 18th International Conference on Machine Learning. Williamstown, MA, USA: Scholarly Commons, 2001: 282-289.
[3] ZHANG Meishan, DENG Zhilong, CHE Wanxiang, et al. Combining statistical model and dictionary for domain adaption of Chinese word segmentation [J]. Journal of Chinese Information Processing, 2012, 26(2): 8-12.
[4] Fosler-Lussier E, HE Yanzhang, Jyothi P, et al. Conditional random fields in speech, audio, and language processing [J]. Proceedings of the IEEE, 2013, 101(5): 1054-1075.
[5] YANG Yanfeng, YANG Yanqin, GUAN Hu, et al. Out-of-vocabulary words recognition based on conditional random field in electronic commerce [J]. Lecture Notes in Computer Science, 2014, 8835: 532-539.
[6] Chellappa R, Fain A, Chellappa R, et al. Markov Random Fields: Theory and Applications [M]. San Diego, CA, USA: Academic Press Inc., 1993.
[7] CHEN Lei, LI Miao, ZHANG Jian, et al. A double-layer word segmentation combined with local ambiguity word grid and CRF [J]. Transactions on Computer Science & Technology, 2013 (1): 1-8.
[8] Ray A, Chandawala A, Chaudhury S. Character Recognition Using Conditional Random Field Based Recognition Engine [C]// IEEE, Proceedings of 12th International Conference on Document Analysis and Recognition. Washington DC, USA: IEEE Computer Society, 2013: 18-22.
[9] Sha F, Pereira F, Shallow parsing with conditional random fields [C]// ACL, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Boston, MA, USA: Association for Computational Linguistics, 2003, 1: 134-141.
[10] Nocedal J, Updating quasi-Newton matrices with limited storage [J]. Mathematics of computation, 1980, 35(151): 773-782.
[11]ZHAO Hai, HUANG Changning, LI Mu. An improved Chinese word segmentation system with conditional random field [C]// ACL, 2006a. Proceedings of the Fifth Sighan Workshop on Chinese Language Processing. Sydney, Australia: Association for Computational Linguistics, 2006: 162-165.
[12]Tseng H, Chang P, Andrew G, et al. A conditional random field word segmenter for Sighan bakeoff 2005 [C]// ACL, Proceedings of the Fourth Sighan Workshop on Chinese Language Processing. Jeju Island, Korea: Association for Computational Linguistics, 2005: 168-171.
[13]PENG Fuchuan, FENG Fangfang, Mccallum A. Chinese segmentation and new word detection using conditional random fields [C]// Proceedings of Coling 2004. Genera, Switzerland, 2004: 562-568. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|