Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  百年期刊
Journal of Tsinghua University(Science and Technology)    2023, Vol. 63 Issue (9) : 1309-1316     DOI: 10.16511/j.cnki.qhdxxb.2023.21.010
BIG DATA |
Two-stage open information extraction method for the defence technology field
HU Minghao, WANG Fang, XU Xiantao, LUO Wei, LIU Xiaopeng, LUO Zhunchen, Tan Yushan
Information Research Center of Military Science, PLA Academy of Military Science, Beijing 100142, China
Download: PDF(1224 KB)  
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks    
Abstract  [Objective] The abundant information resources available on the internet about defense technology are of vital importance as data sources for obtaining high-value military intelligence. The aim of open information extraction in the field of defense technology is to extract structured triplets containing subject, predicate, object, and other arguments from the massive amount of information available on the internet. This technology has important implications for ontology induction and the construction of knowledge graphs in the defense technology domain. However, while information extraction experiments in the general domain yield good results, open information extraction in the defense technology domain faces several challenges, such as a lack of domain annotated data, arguments overlapping unadaptability, and unrecognizable long entities.[Methods] In this paper, an annotation strategy is proposed based on the entity boundaries, and an annotated dataset in the defense technology field combined with the experience of domain experts was constructed. Furthermore, a two-stage open information extraction method is proposed in the defense technology field that utilizes a pretrained language model-based sequence labeling algorithm to extract predicates and a multihead attention mechanism to learn the prediction of argument boundaries. In the first stage, the input sentence was converted into an input sequence <[CLS], input sentence[SEP]>, and the input sequence was encoded using a pretrained language model to obtain an implicit state representation of the input sequence. Based on this sentence representation, a conditional random field (CRF) layer was used to predict the position of the predicates, i.e., to predict the BIO labels of the words. In the second stage, the predicated predicates from the first stage were concatenated with the original sentence and converted into an input sequence <[CLS], predicate[SEP], and input sentence[SEP]>, which was encoded using a pretrained language model to obtain an implicit state representation of the input sequence. This representation was then fed to a multihead pointer network to predict the position of the argument. The predicted position was tagged with the actual position to calculate the cross-entropy loss function. Finally, the predicates and the arguments predicted by the predicate and argument extraction models were combined to obtain the complete triplet.[Results] The experimental results from the extensive experiments conducted on a self-built annotated dataset in the defense technology field reveal the following. (1) In predicate extraction, our method achieved a 3.92% performance improvement in the F1 value as compared to LSTM methods and more than 10% performance improvement as compared to syntactic analysis methods. (2) In argument extraction, our method achieved a considerable performance improvement of more than 16% in the F1 value as compared to LSTM methods and about 11% in the F1 value as compared to the BERT+CRF method.[Conclusions] The proposed two-stage open information extraction method can overcome the challenge of arguments overlapping unadaptability and the difficulty of long-span entity extraction, thus improving the shortcomings of existing open information extraction methods. Extensive experimental analysis conducted on the self-built annotated dataset proved the effectiveness of the proposed method.
Keywords defense technology      open information extraction      subject-verb-object complement      knowledge graph      pretrained language model     
Issue Date: 19 August 2023
Service
E-mail this article
E-mail Alert
RSS
Articles by authors
HU Minghao
WANG Fang
XU Xiantao
LUO Wei
LIU Xiaopeng
LUO Zhunchen
Tan Yushan
Cite this article:   
HU Minghao,WANG Fang,XU Xiantao, et al. Two-stage open information extraction method for the defence technology field[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(9): 1309-1316.
URL:  
http://jst.tsinghuajournals.com/EN/10.16511/j.cnki.qhdxxb.2023.21.010     OR     http://jst.tsinghuajournals.com/EN/Y2023/V63/I9/1309
  
  
  
  
  
  
  
  
[1] ETZIONI O, BANKO M, SODERLAND S, et al. Open information extraction from the web[J]. Communications of the ACM, 2008, 51(12): 68-74.
[2] MAUSAM M. Open information extraction systems and downstream applications[C]// Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. New York, USA: AAAI Press, 2016: 4074-4077.
[3] GUO Z J, ZHANG Y, LU W. Attention guided graph convolutional networks for relation extraction[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019: 241-251.
[4] ZHAO S, HU M H, CAI Z P, et al. Modeling dense cross-modal interactions for joint entity-relation extraction[C]// Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. Yokohama, Japan: International Joint Conferences on Artificial Intelligence, 2021: 4032-4038.
[5] STANOVSKY G, MICHAEL J, ZETTLEMOYER L, et al. Supervised open information extraction[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1(Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, 2018.
[6] PAL H M. Demonyms and compound relational nouns in nominal open IE[C]// Proceedings of the 5th Workshop on Automated Knowledge Base Construction. San Diego, USA: Association for Computational Linguistics, 2016: 35-39.
[7] FAN A, GARDENT C, BRAUD C, et al. Using local knowledge graph construction to scale seq2seq models to multi-document inputs[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019: 4186-4196.
[8] BALASUBRAMANIAN N, SODERLAND S, MAUSAM, et al. Generating coherent event schemas at scale[C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, USA: Association for Computational Linguistics, 2013.
[9] STANOVSKY G, DAGAN I, MAUSAM. Open IE as an intermediate structure for semantic tasks[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Beijing, China: Association for Computational Linguistics, 2015: 303-308.
[10] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1(Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019: 4171-4186.
[11] ROY A, PARK Y, LEE T, et al. Supervising unsupervised open information extraction models[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong, China: Association for Computational Linguistics, 2019.
[12] RO Y, LEE Y, KANG P. Multi2OIE: Multilingual open information extraction based on multi-head attention with BERT[C]// Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, 2020: 1107-1117.
[13] KOLLURU K, ADLAKHA V, AGGARWAL S, et al. OpenIE6: Iterative grid labeling and coordination analysis for open information extraction[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020: 3748-3761.
[14] CUI L, WEI F, ZHOU M. Neural open information extraction[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, 2018.
[15] KOLLURU K, AGGARWAL S, RATHORE V, et al. IMoJIE: Iterative memory-based joint open information extraction[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020: 5871-5886.
[16] SUN M M, LI X, WANG X, et al. Logician: A unified end-to-end neural approach for open-domain information extraction[C]// Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. Marina Del Rey, USA: Association for Computing Machinery, 2018: 556-564.
[17] SEO M J, KEMBHAVI A, FARHADI A, et al. Bidirectional attention flow for machine comprehension[C]// 5th International Conference on Learning Representations. Toulon, France: OpenReview.net, 2017.
[18] RAJPURKAR P, ZHANG J, LOPYREV K, et al. SQuAD: 100, 000+ questions for machine comprehension of text[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, 2016.
[19] LEE K, SALANT S, KWIATKOWSKI T, et al. Learning recurrent span representations for extractive question answering[Z]. arXiv preprint arXiv: 1611.01436, 2016.
[20] LI X Y, YIN F, SUN Z J, et al. Entity-relation extraction as multi-turn question answering[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019: 1340-1350.
[21] LI X Y, FENG J R, MENG Y X, et al. A unified MRC framework for named entity recognition[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020: 5849-5859.
[22] ZHAN J L, ZHAO H. Span model for open information extraction on accurate corpus[C]// Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2020: 9523-9530.
[23] STANOVSKY G, DAGAN I. Creating a large benchmark for open information extraction[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, 2016: 2300-2305.
[24] BHARDWAJ S, AGGARWAL S, MAUSAM M. CaRB: A crowdsourced benchmark for open IE[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019: 6262-6267.
[1] CHEN Chuangang, HU Jinqiu, HAN Zicong, CHEN Yiyue, XIAO Shangrui. Knowledge graph based early warning method for accident evolution in natural gas pipeline station abroad for harsh environmental conditions[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(6): 1081-1087.
[2] WANG Yichao, ZHU Muhua, XU Chen, ZHANG Yan, WANG Huizhen, ZHU Jingbo. Exploiting image captions and external knowledge as representation enhancement for VQA[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 900-907.
[3] WANG Liping, ZHANG Chao, CAI Enlei, SHI Huijie, WANG Dong. Knowledge extraction and knowledge base construction method from industrial software packages[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 978-986.
[4] LUO Zhihao, LI Jin, YUE Kun, MAO Yuyuan, LIU Yan. Mining Top-k summarization patterns for knowledge graphs[J]. Journal of Tsinghua University(Science and Technology), 2019, 59(3): 194-202.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
Copyright © Journal of Tsinghua University(Science and Technology), All Rights Reserved.
Powered by Beijing Magtech Co. Ltd