Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2024, Vol. 64 Issue (5): 749-759    DOI: 10.16511/j.cnki.qhdxxb.2024.26.010
  专题:社会媒体处理 本期目录 | 过刊浏览 | 高级检索 |
基于语义边界增强的司法命名实体识别
张天宇1, 孙媛媛1, 杜文玉2, 邢铁军3, 林鸿飞1, 杨亮1
1. 大连理工大学 计算机学院, 大连 116024;
2. 最高人民检察院 检察技术信息研究中心, 北京 100726;
3. 东软集团股份有限公司, 大连 116024
Judicial named entity recognition enhanced with semantic and boundary
ZHANG Tianyu1, SUN Yuanyuan1, DU Wenyu2, XING Tiejun3, LIN Hongfei1, YANG Liang1
1. School of Computer Science, Dalian University of Technology, Dalian 116024, China;
2. Procuratorial Technology and Information Research Center, Supreme People's Procuratorate, Beijing 100726, China;
3. Neusoft Corporation, Dalian 116024, China
全文: PDF(5513 KB)   HTML 
输出: BibTeX | EndNote (RIS)      
摘要 法律文书命名实体识别是智慧司法的关键任务。现有的序列标注模型仅关注字符信息, 导致在法律文书命名实体识别任务中无法获得语义和词语的上下文信息, 且无法对实体的边界进行限制。因此, 该文提出了一个融合外部信息并对边界限制的司法命名实体识别模型(semantic and boundary enhance named entity recognition, SBENER)。该模型收集了40万条盗窃罪法律文书, 首先, 预训练模型, 将获得的司法盗窃罪词向量作为输入模型的外部信息; 其次, 设计 Adapter, 将司法盗窃罪的信息融入字符序列以增强语义特征; 最后, 使用边界指针网络对实体边界进行限制, 解决了序列标注模型丢失词语信息及缺少边界限制的问题。该模型在 CAILIE 1.0 数据集和LegalCorpus数据集上进行实验, 结果表明, SBENER模型在2个数据集上的F1 值(F1-score)分别达88.70 %和87.67 %, 比其他基线模型取得了更好的效果。SBENER模型能够提升司法领域命名实体识别的效果。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
关键词 法律文书外部法律信息实体边界命名实体识别    
Abstract:[Objective] Named entity recognition (NER), a central task in the information extraction realm, aims to precisely identify various named entity types in textual content, including personal names, locations, and organizational names. In Chinese NER domain, deep learning techniques are crucial for character and vocabulary representations and feature extractions, yielding remarkable research achievements. Common deep learning models for NER include sequence labeling, span-based approaches, generative methods, and table-based strategies. Nevertheless, this task suffers from the scarcity of lexical information. Hence, this challenge is perceived as a primary hindrance limiting the development of high-performance Chinese NER systems. Despite developing extensive lexical dictionaries encompassing rich vocabulary boundaries and semantic insights, effective incorporation of this lexical knowledge into Chinese NER task remains a considerable challenge. Particularly, the seamless integration of semantic information from matching vocabulary and its contextual cues into Chinese character sequence remains intricate. Moreover, ensuring the accurate delimitation of named entity boundaries is still a remarkable concern. In the realm of intelligent judicial systems, the NER task within legal documents has garnered significant attention. Nonetheless, prevailing sequence labeling models predominantly rely on character information, constraining their capacity to capture semantic and lexical contextual nuances and inadequately addressing entity boundary constraints. To resolve these challenges, this paper introduces an innovative model called semantic and boundary enhanced named entity recognition (SBENER). To enhance the semantic features of legal documents within the SBENER model, external information containing vocabulary pertinent to theft crimes is smartly integrated. Initially, word vectors for theft crime terms are acquired through pretraining. Subsequently, a vocabulary dictionary tree is constructed, enabling the potential vocabulary candidate identification for each character. Further, these candidates are amalgamated into a final external information vector via a bilinear attention mechanism. Additionally, a linear gating structure is introduced to mitigate interference from external information in the original text. To overcome the limitations of sequence labeling models for managing entity boundary constraints, this study designs a boundary pointer network within the model to confine entity boundaries. This involves embedding the character sequence into hidden layer representations via bidirectional long short-term memory followed by decoding to introduce probability constraints for each entity span. Ultimately, contextual and boundary information is inputted into a conditional random field for obtaining the ultimate entity classification outcomes. This design adroitly tackles the issues of vocabulary loss and boundary constraint scarcity within sequence labeling models. Experimental results on the CAILIE 1.0 and LegalCorpus datasets corroborated the effectiveness of the proposed method, yielding F1 scores of 88.70 % and 87.67 %, respectively, surpassing other baseline models. Additionally, the study conducted ablation experiments to validate the effectiveness of each component. The experimental results showed that integrating external semantic information related to theft, enhancing entity boundary constraints through pointer networks, and incorporating gating mechanisms to restrict irrelevant information fusion were all effective approaches for achieving higher F1 scores for the model. Furthermore, this paper applied dimensionality reduction to external semantic word vector information and conducted experimental analysis on different fusion layers. Single-layer fusion outperformed multilayer fusion, while fusion at intermediate levels yielded better results. This underscored the marked enhancement in judicial NER facilitated by the proposed approach. The SBENER model effectively enhances the proficiency of recognizing named entities in legal documents through the fusion of external information and reinforcement of boundary constraints. This pioneering method substantially contributes to advancements within the intelligent judicial systems.
Key wordslegal document    external law information    entity boundary    named entity recognition
收稿日期: 2023-08-29      出版日期: 2024-04-22
基金资助:国家重点研发计划项目(2022YFC3301801);中央高校基本科研业务费资助项目(DUT22ZD205)
通讯作者: 孙媛媛,教授,E-mail:syuan@dlut.edu.cn     E-mail: syuan@dlut.edu.cn
引用本文:   
张天宇, 孙媛媛, 杜文玉, 邢铁军, 林鸿飞, 杨亮. 基于语义边界增强的司法命名实体识别[J]. 清华大学学报(自然科学版), 2024, 64(5): 749-759.
ZHANG Tianyu, SUN Yuanyuan, DU Wenyu, XING Tiejun, LIN Hongfei, YANG Liang. Judicial named entity recognition enhanced with semantic and boundary. Journal of Tsinghua University(Science and Technology), 2024, 64(5): 749-759.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2024.26.010  或          http://jst.tsinghuajournals.com/CN/Y2024/V64/I5/749
[1] 田荔枝. 法律文书学[M]. 济南:山东人民出版社, 2008. TIAN L Z. Legal documents[M]. Ji'nan:Shandong People's Publishing House, 2008. (in Chinese)
[2] DEVLIN J, CHANG M W, LEE K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Minneapolis, USA:Association for Computational Linguistics, 2019:4171-4186.
[3] 郭喜跃, 何婷婷. 信息抽取研究综述[J]. 计算机科学, 2015, 42(2):14-17, 38. GUO X Y, HE T T. Survey about research on information extraction[J]. Computer Science, 2015, 42(2):14-17, 38. (in Chinese)
[4] ZHANG Y, YANG J. Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia:Association for Computational Linguistics, 2018:1554-1564.
[5] LI X N, YAN H, QIU X P, et al. FLAT:Chinese NER using flat-lattice transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Seattle, USA:Association for Computational Linguistics, 2020:6836-6842.
[6] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA:Curran Associates Inc., 2017:6000-6010.
[7] SHEN Y L, MA X Y, TAN Z Q, et al. Locate and label:A two-stage identifier for nested named entity recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand:Association for Computational Linguistics, 2021:2782-2794.
[8] LI F, WANG Z, HUI S C, et al. Modularized interaction network for named entity recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand:Association for Computational Linguistics, 2021:200-209.
[9] ZHU E W, LI J P. Boundary smoothing for named entity recognition[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland:Association for Computational Linguistics, 2022:7096-7108.
[10] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1(Long Papers). New Orleans, USA:Association for Computational Linguistics, 2018:2227-2237.
[11] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training. Openai. (2023-01-06). https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035.
[12] WANG X Y, JIANG Y, BACH N, et al. Improving named entity recognition by external context retrieving and cooperative learning. arXiv. (2023-05-08). https://arxiv.org/abs/2105.03654.
[13] SUN Z J, LI X Y, SUN X F, et al. ChineseBERT:Chinese pretraining enhanced by glyph and pinyin information[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand:Association for Computational Linguistics, 2021:2065-2075.
[14] LI X Y, FENG J R, MENG Y X, et al. A unified MRC framework for named entity recognition[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Seattle, USA:Association for Computational Linguistics, 2020:5849-5859.
[15] WU S, SONG X N, FENG Z H. MECT:Multi-metadata embedding based cross-transformer for Chinese named entity recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand:Association for Computational Linguistics, 2021:1529-1539.
[16] 李春楠, 王雷, 孙媛媛, 等. 基于BERT的盗窃罪法律文书命名实体识别方法[J]. 中文信息学报, 2021, 35(8):73-81. LI C N, WANG L, SUN Y Y, et al. BERT based named entity recognition for legal texts on theft cases[J]. Journal of Chinese Information Processing, 2021, 35(8):73-81. (in Chinese)
[17] SHEN Y K, TAN S, SORDONI A, et al. Ordered neurons:Integrating tree structures into recurrent neural networks. arXiv. (2018-10-22). https://arxiv.org/abs/1810.09536.
[18] 邓依依. 面向裁判文书的命名实体识别研究[D]. 南昌:华东交通大学, 2021. DENG Y Y. Research on named entity recognition for judgment documents[D]. Nanchang:East China Jiaotong University, 2021. (in Chinese)
[19] 朱明. 基于语义树的司法判决文书分析方法研究[D]. 南京:南京邮电大学, 2021.ZHU M. Research on the analysis method of judicial judgment documents based on semantic tree[D]. Nanjing:Nanjing University of Posts and Telecommunications, 2021. (in Chinese)
[20] LIU W, FU X Y, ZHANG Y, et al. Lexicon enhanced Chinese sequence labeling using BERT adapter[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand:Association for Computational Linguistics, 2021:5847-5858.
[21] LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning. Williamstown, USA:Morgan Kaufmann Publishers Inc., 2001:282-289.
[22] CAO Y, SUN Y Y, XU C, et al. CAILIE 1.0:A dataset for challenge of AI in law-information extraction V1.0[J]. AI Open, 2022, 3:208-212.
[23] CUI L Y, WU Y, LIU J, et al. Template-based named entity recognition using BART[C]//Findings of the Association for Computational Linguistics:ACL-IJCNLP 2021. Bangkok, Thailand:Association for Computational Linguistics, 2021:1835-1845.
[24] SONG Y, SHI S M, LI J, et al. Directional skip-gram:Explicitly distinguishing left and right context for word embeddings[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. New Orleans, USA:Association for Computational Linguistics, 2018:175-180.
[25] JAWAHAR G, SAGOT B, SEDDAH D. What does BERT learn about the structure of language?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy:Association for Computational Linguistics, 2019:3651-3657.
[1] 王庆人, 王银子, 仲红, 张以文. 面向中文的字词组合序列实体识别方法[J]. 清华大学学报(自然科学版), 2023, 63(9): 1326-1338.
[2] 胡滨, 耿天玉, 邓赓, 段磊. 基于知识蒸馏的高效生物医学命名实体识别模型[J]. 清华大学学报(自然科学版), 2021, 61(9): 936-942.
[3] 尹学振, 赵慧, 赵俊保, 姚婉薇, 黄泽林. 多神经网络协作的军事领域命名实体识别[J]. 清华大学学报(自然科学版), 2020, 60(8): 648-655.
[4] 李明扬, 孔芳. 融入自注意力机制的社交媒体命名实体识别[J]. 清华大学学报(自然科学版), 2019, 59(6): 461-467.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn