Judicial named entity recognition enhanced with semantic and boundary
ZHANG Tianyu1, SUN Yuanyuan1, DU Wenyu2, XING Tiejun3, LIN Hongfei1, YANG Liang1
1. School of Computer Science, Dalian University of Technology, Dalian 116024, China; 2. Procuratorial Technology and Information Research Center, Supreme People's Procuratorate, Beijing 100726, China; 3. Neusoft Corporation, Dalian 116024, China
Abstract:[Objective] Named entity recognition (NER), a central task in the information extraction realm, aims to precisely identify various named entity types in textual content, including personal names, locations, and organizational names. In Chinese NER domain, deep learning techniques are crucial for character and vocabulary representations and feature extractions, yielding remarkable research achievements. Common deep learning models for NER include sequence labeling, span-based approaches, generative methods, and table-based strategies. Nevertheless, this task suffers from the scarcity of lexical information. Hence, this challenge is perceived as a primary hindrance limiting the development of high-performance Chinese NER systems. Despite developing extensive lexical dictionaries encompassing rich vocabulary boundaries and semantic insights, effective incorporation of this lexical knowledge into Chinese NER task remains a considerable challenge. Particularly, the seamless integration of semantic information from matching vocabulary and its contextual cues into Chinese character sequence remains intricate. Moreover, ensuring the accurate delimitation of named entity boundaries is still a remarkable concern. In the realm of intelligent judicial systems, the NER task within legal documents has garnered significant attention. Nonetheless, prevailing sequence labeling models predominantly rely on character information, constraining their capacity to capture semantic and lexical contextual nuances and inadequately addressing entity boundary constraints. To resolve these challenges, this paper introduces an innovative model called semantic and boundary enhanced named entity recognition (SBENER). To enhance the semantic features of legal documents within the SBENER model, external information containing vocabulary pertinent to theft crimes is smartly integrated. Initially, word vectors for theft crime terms are acquired through pretraining. Subsequently, a vocabulary dictionary tree is constructed, enabling the potential vocabulary candidate identification for each character. Further, these candidates are amalgamated into a final external information vector via a bilinear attention mechanism. Additionally, a linear gating structure is introduced to mitigate interference from external information in the original text. To overcome the limitations of sequence labeling models for managing entity boundary constraints, this study designs a boundary pointer network within the model to confine entity boundaries. This involves embedding the character sequence into hidden layer representations via bidirectional long short-term memory followed by decoding to introduce probability constraints for each entity span. Ultimately, contextual and boundary information is inputted into a conditional random field for obtaining the ultimate entity classification outcomes. This design adroitly tackles the issues of vocabulary loss and boundary constraint scarcity within sequence labeling models. Experimental results on the CAILIE 1.0 and LegalCorpus datasets corroborated the effectiveness of the proposed method, yielding F1 scores of 88.70 % and 87.67 %, respectively, surpassing other baseline models. Additionally, the study conducted ablation experiments to validate the effectiveness of each component. The experimental results showed that integrating external semantic information related to theft, enhancing entity boundary constraints through pointer networks, and incorporating gating mechanisms to restrict irrelevant information fusion were all effective approaches for achieving higher F1 scores for the model. Furthermore, this paper applied dimensionality reduction to external semantic word vector information and conducted experimental analysis on different fusion layers. Single-layer fusion outperformed multilayer fusion, while fusion at intermediate levels yielded better results. This underscored the marked enhancement in judicial NER facilitated by the proposed approach. The SBENER model effectively enhances the proficiency of recognizing named entities in legal documents through the fusion of external information and reinforcement of boundary constraints. This pioneering method substantially contributes to advancements within the intelligent judicial systems.
张天宇, 孙媛媛, 杜文玉, 邢铁军, 林鸿飞, 杨亮. 基于语义边界增强的司法命名实体识别[J]. 清华大学学报(自然科学版), 2024, 64(5): 749-759.
ZHANG Tianyu, SUN Yuanyuan, DU Wenyu, XING Tiejun, LIN Hongfei, YANG Liang. Judicial named entity recognition enhanced with semantic and boundary. Journal of Tsinghua University(Science and Technology), 2024, 64(5): 749-759.
[1] 田荔枝. 法律文书学[M]. 济南:山东人民出版社, 2008. TIAN L Z. Legal documents[M]. Ji'nan:Shandong People's Publishing House, 2008. (in Chinese) [2] DEVLIN J, CHANG M W, LEE K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Minneapolis, USA:Association for Computational Linguistics, 2019:4171-4186. [3] 郭喜跃, 何婷婷. 信息抽取研究综述[J]. 计算机科学, 2015, 42(2):14-17, 38. GUO X Y, HE T T. Survey about research on information extraction[J]. Computer Science, 2015, 42(2):14-17, 38. (in Chinese) [4] ZHANG Y, YANG J. Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia:Association for Computational Linguistics, 2018:1554-1564. [5] LI X N, YAN H, QIU X P, et al. FLAT:Chinese NER using flat-lattice transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Seattle, USA:Association for Computational Linguistics, 2020:6836-6842. [6] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA:Curran Associates Inc., 2017:6000-6010. [7] SHEN Y L, MA X Y, TAN Z Q, et al. Locate and label:A two-stage identifier for nested named entity recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand:Association for Computational Linguistics, 2021:2782-2794. [8] LI F, WANG Z, HUI S C, et al. Modularized interaction network for named entity recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand:Association for Computational Linguistics, 2021:200-209. [9] ZHU E W, LI J P. Boundary smoothing for named entity recognition[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland:Association for Computational Linguistics, 2022:7096-7108. [10] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1(Long Papers). New Orleans, USA:Association for Computational Linguistics, 2018:2227-2237. [11] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training. Openai. (2023-01-06). https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035. [12] WANG X Y, JIANG Y, BACH N, et al. Improving named entity recognition by external context retrieving and cooperative learning. arXiv. (2023-05-08). https://arxiv.org/abs/2105.03654. [13] SUN Z J, LI X Y, SUN X F, et al. ChineseBERT:Chinese pretraining enhanced by glyph and pinyin information[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand:Association for Computational Linguistics, 2021:2065-2075. [14] LI X Y, FENG J R, MENG Y X, et al. A unified MRC framework for named entity recognition[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Seattle, USA:Association for Computational Linguistics, 2020:5849-5859. [15] WU S, SONG X N, FENG Z H. MECT:Multi-metadata embedding based cross-transformer for Chinese named entity recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand:Association for Computational Linguistics, 2021:1529-1539. [16] 李春楠, 王雷, 孙媛媛, 等. 基于BERT的盗窃罪法律文书命名实体识别方法[J]. 中文信息学报, 2021, 35(8):73-81. LI C N, WANG L, SUN Y Y, et al. BERT based named entity recognition for legal texts on theft cases[J]. Journal of Chinese Information Processing, 2021, 35(8):73-81. (in Chinese) [17] SHEN Y K, TAN S, SORDONI A, et al. Ordered neurons:Integrating tree structures into recurrent neural networks. arXiv. (2018-10-22). https://arxiv.org/abs/1810.09536. [18] 邓依依. 面向裁判文书的命名实体识别研究[D]. 南昌:华东交通大学, 2021. DENG Y Y. Research on named entity recognition for judgment documents[D]. Nanchang:East China Jiaotong University, 2021. (in Chinese) [19] 朱明. 基于语义树的司法判决文书分析方法研究[D]. 南京:南京邮电大学, 2021.ZHU M. Research on the analysis method of judicial judgment documents based on semantic tree[D]. Nanjing:Nanjing University of Posts and Telecommunications, 2021. (in Chinese) [20] LIU W, FU X Y, ZHANG Y, et al. Lexicon enhanced Chinese sequence labeling using BERT adapter[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand:Association for Computational Linguistics, 2021:5847-5858. [21] LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning. Williamstown, USA:Morgan Kaufmann Publishers Inc., 2001:282-289. [22] CAO Y, SUN Y Y, XU C, et al. CAILIE 1.0:A dataset for challenge of AI in law-information extraction V1.0[J]. AI Open, 2022, 3:208-212. [23] CUI L Y, WU Y, LIU J, et al. Template-based named entity recognition using BART[C]//Findings of the Association for Computational Linguistics:ACL-IJCNLP 2021. Bangkok, Thailand:Association for Computational Linguistics, 2021:1835-1845. [24] SONG Y, SHI S M, LI J, et al. Directional skip-gram:Explicitly distinguishing left and right context for word embeddings[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. New Orleans, USA:Association for Computational Linguistics, 2018:175-180. [25] JAWAHAR G, SAGOT B, SEDDAH D. What does BERT learn about the structure of language?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy:Association for Computational Linguistics, 2019:3651-3657.