1. School of Information Science and Engnineering, Xinjiang University, Urumqi 830046, China;
2. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Abstract:Uyghur is a very agglutinative language which belongs to the Altaic family of languages with a very complex morphology. Uyghur names have many origins, so they are difficult to analyze and recognize. Thus, there is no well-developed toolkit for name recognition in Uyghur. An investigation of a large Uyghur text shows that 83% of all the names are either Uyghur names or Chinese names. Therefore, this work focuses on these two kinds of names with specific solutions for recognizing them in Uyghur texts. A letter-based fuzzy matching method is used for the Uyghur names with a syllable-character conversion method based on a machine translation method for the Chinese names. Tests show that this method achieves a 91.84% F1 score for the Uyghur names and 95.86% for the Chinese names.
热合木·马合木提, 于斯音·于苏普, 张家俊, 宗成庆, 艾斯卡尔·艾木都拉. 基于模糊匹配与音字转换的维吾尔语人名识别[J]. 清华大学学报(自然科学版), 2017, 57(2): 188-196.
Abdurahim Mahmoud, Hussein Yusuf, ZHANG Jiajun, ZONG Chengqing, Askar Hamdulla. Name recognition in the Uyghur language based on fuzzy matching and syllable-character conversion. Journal of Tsinghua University(Science and Technology), 2017, 57(2): 188-196.
陈钰枫, 宗成庆, 苏克毅. 汉英双语命名实体识别与对齐的交互式方法[J]. 计算机学报, 2011, 34(9):1688-1696.CHEN Yufeng, ZONG Chengqing, SU Keh-Yih. Joint Chinese-English name entity recognition and alignment[J]. Chinese Journal of Computers, 2011, 34(9):1688-1696. (in Chinese)
[2]
Elsebai A, Meziane F, Belkredim F Z. A rule based persons names Arabic extraction system[J]. Communications of the IBIMA, 2009, 11:53-60.
[3]
Aboaoga M, Aziz M J A. Arabic person names recognition by using a rule based approach[J]. Journal of Computer Science, 2013, 9(7):922-927.
[4]
姜伟. 基于规则的中文人名识别与抽取关键技术研究[J]. 科技创新导报, 2012, 28:65-66.JIANG Wei. Research on key technology of Chinese name recognition and extraction based on rules[J]. Science and Technology Innovation Herald, 2012, 28:65-66. (in Chinese)
[5]
宗成庆. 统计自然语言处理[M]. 北京:清华大学出版社, 2013.ZONG Chengqing. Statistical Natural Language Processing[M]. Beijing:Tsinghua University Press, 2013. (in Chinese)
[6]
俞鸿魁, 张华平, 刘群, 等. 基于层叠隐马尔可夫模型的中文命名实体识别[J]. 通信学报, 2006, 27(2):87-94.YU Hongkui, ZHANG Huaping, LIU Qun, et al. Chinese named entity identification using cascaded hidden Markov model[J]. Journal on Communications, 2006, 27(2):87-94. (in Chinese)
[7]
钱晶, 张玥杰, 张涛. 基于最大熵的汉语人名地名识别方法研究[J].小型微型计算机系统, 2006, 27(9):1761-1766.QIAN Jing, ZHANG Yuejie, ZHANG Tao. Research on Chinese person name and location name recognition based on maximum entropy model[J]. Journal of Chinese Mini-Micro Computer Systems, 2006, 27(9):1761-1766. (in Chinese)
[8]
赵伟, 李丹. SVM与错误驱动学习相结合的中文人名识别[J]. 长春工业大学学报:自然科学版, 2009, 30(4):396-400.ZHAO Wei, LI Dan. Chinese name identification based on both support vector machine and error-driven learning[J]. Journal of Changchun University of Technology:Natural Science Edition, 2009, 30(4):396-400. (in Chinese)
[9]
唐钊. 条件随机场模型在中文人名识别中的研究与实现[J]. 现代计算机, 2012(21):3-7.TANG Zhao. Research and implementation of conditional random field model in Chinese personal name recognition[J]. Modern Computer, 2012(21):3-7. (in Chinese)
[10]
Muhtar Arkin, Rahim Mahmut, Askar Hamdulla. Person name recognition for Uyghur using conditional random fields[J]. International Journal of Computer Science Issues, 2013, 10(2):130-136.
[11]
LI Lishuang, HUANG Degen, LI Dan. Recognizing Chinese person names based on hybrid models[J]. International Journal of Advanced Intelligence, 2011, 3(2):219-228.
[12]
潘正高. 基于规则和统计相结合的中文命名实体识别研究[J]. 情报科学, 2012, 30(5):708-714.PAN Zhenggao. Research on the recognition of Chinese named entity based on rules and statistics[J]. Information Science, 2012, 30(5):708-714. (in Chinese)
[13]
和雪娟, 陈玉华, 高丽金, 等. 基于统计和规则混合策略的中国人名识别研究[J]. 云南民族大学学报:自然科学版, 2009, 18(1):70-74.HE Xuejuan, CHEN Yuhua, GAO Lijin et al. On the identifying system for Chinese names based on a combination of statistic analysis and rules[J]. Journal of Yunnan Nationalities University:Natural Sciences Edition, 2009, 18(1):70-74. (in Chinese)
[14]
闫萍. 基于规则和概率统计相结合的中文命名实体识别研究[J]. 计算机与数字工程, 2011, 39(9):88-92.YAN Ping. Research on the identification for Chinese named entity based on combination of rules and statistic analysis[J]. Computer and Digital Engineering, 2011, 39(9):88-92. (in Chinese)
[15]
窦嵘, 加羊吉, 黄伟. 统计与规则相结合的藏文人名自动识别研究[J]. 长春工业大学学报:自然科学版, 2010, 11(2):113-115.DOU Rong, JIA Yangji, HUANG Wei. Automatic recognition of Tibetan name with the combination of statistics and regular[J]. Journal of Changchun University of Technology:Natural Science Edition, 2010, 11(2):113-115. (in Chinese)
[16]
李佳正, 刘凯, 麦热哈巴·艾力, 等. 维吾尔语中汉族人名的识别及翻译[J]. 中文信息学报, 2011, 25(4):82-87.LI Jiazheng, LIU Kai, Mairehaba Aili, et al. Recognition and translation for Chinese names in Uyghur language[J]. Journal of Chinese Information Processing, 2011, 25(4):82-87. (in Chinese)
[17]
新疆维吾尔自治区民族语言文字工作委员会.现代维吾尔文学语言正字词典[M]. 乌鲁木齐:新疆人民出版社, 2009.Xinjiang Uyghur Autonomous Region Ethnic Language Work Committee. Modern Uyghur Literary Language Orthography Dictionary[M]. Urumqi:Xinjiang People's Publishing House, 2009. (in Uyghur)
[18]
Gulila Altenbek. Rule-based person name recognition for Xinjiang minority languages[J]. Journal of Chinese Language and Computing, 2005, 15(4):219-226.
[19]
冯鲸华, 古丽拉·阿东别克, 吴守用, 等. 基于位置概率模型的哈萨克语人名识别[J].计算机应用与软件, 2010, 27(12):21-24.FENG Jinghua, Gulila Altenbek, WU Shouyong, et al. Kazakh personal name recognition based on position probability model[J]. Computer Applications and Software, 2010, 27(12):21-24. (in Chinese)
[20]
艾斯卡尔·肉孜, 宗成庆, 姑丽加玛丽·麦麦提艾力, 等. 基于条件随机场的维吾尔人名识别方法[J]. 清华大学学报:自然科学版, 2013, 53(6):873-877.Askar Rozi, ZONG Chengqing, Guljamal Mamateli, et al. Approach to recognition Uyghur names based on conditional random fields[J]. Journal of Tsinghua University:Science and Technology, 2013, 53(6):873-877. (in Chinese)
[21]
秦佳, 杨建峰, 薛彬, 等. 基于向量相似度匹配准则的图像配准与拼接[J]. 微电子学与计算机, 2013, 30(6):22-25.QIN Jia, YANG Jianfeng, XUE Bin, et al. Image registration and mosaic based on vector similarity matching principle[J]. Micro-Electronics & Computer, 2013, 30(6):22-25. (in Chinese)
[22]
赵亚慧. 基于编辑距离的中文机构名简称检索方法研究[J]. 内蒙古科技与经济, 2010(7):69-70.ZHAO Yahui. Research on Chinese institutions name retrieval method based on edit distance[J]. Inner Mongolia Science Technology & Economy, 2010(7):69-70. (in Chinese)
[23]
包西林, 郭辰, 吴敏, 等.自动拼写校对的算法设计与系统实现[J]. 科技和产业, 2013, 13(2):144-148.BAO Xilin, GUO Chen, WU Min, et al. The design and system implementation of automated spelling check algorithm[J]. Science Technology and Industry, 2013, 13(2):144-148. (in Chinese)