Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2023, Vol. 63 Issue (9): 1390-1398    DOI: 10.16511/j.cnki.qhdxxb.2023.21.013
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于句法树节点嵌入的作者识别方法
张洋, 江铭虎
清华大学 人文学院中文系, 计算语言学实验室, 北京 100084
Authorship identification method based on the embedding of the syntax tree node
ZHANG Yang, JIANG Minghu
Computational Linguistics Laboratory, Department of Chinese, School of Humanities, Tsinghua University, Beijing 100084, China
全文: PDF(1270 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 作者识别是通过分析未知文本的写作风格推断作者归属的交叉学科。 现有的研究多基于字符和词汇特征, 而句法关联信息在研究中鲜有涉及。 该文提出了基于句法树节点嵌入的作者识别方法, 将句法树的节点表示成其所有依存弧对应的嵌入之和, 把依存关系信息引入深度学习模型中。 然后构建句法注意力网络, 并通过该网络得到句法感知向量。 该向量同时融合了依存关系、 词性以及单词等信息。 接着通过句子注意力网络得到句子的表示, 最后通过分类器进行分类。 在3个英文数据集的实验中, 该文方法的性能位列第2或3位。 更重要的是, 依存句法组合的引入为模型的解释提供了更多的方向。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张洋
江铭虎
关键词 作者识别句法树节点依存关系注意力机制    
Abstract:[Objective] Authorship identification is a study for inferring authorship of an unknown text by analyzing its stylometry or writing style. The traditional research on authorship identification is generally based on the empirical knowledge of literature or linguistics, whereas modern research mostly relies on mathematical methods to quantify the author's writing style. Currently, researchers have proposed various feature combinations and neural network models. Some feature combinations can achieve better results with traditional machine learning classifiers, while some neural network models can autonomously learn the relationship between the input text and corresponding author to extract text features implicitly. However, the current research mostly focuses on character and lexicon features. Furthermore, the exploration of syntactic features is limited. How to use the dependency relationship between different words in a sentence and combine syntactic features with neural networks still remains unclear. This paper proposes an authorship identification method based on the syntax tree node embedding, which introduces syntactic features into a deep learning model. [Methods] We believe that an author's writing style is mainly reflected in the way he chooses words and constructs sentences. Therefore, this paper mainly develops the authorship identification model from the perspectives of words and sentences. The attention mechanism is used to construct sentence-level features. First, an embedding representation of the syntax tree node is proposed, and the syntax tree node is expressed as a sum of embeddings corresponding to all its dependency arcs. Thus, the information on sentence structure and the association between words are introduced into the neural network model. Then, a syntactic attention network using different embedding methods to vectorize text features, such as dependencies, part-of-speech tags, and words, is constructed, and a syntax-aware vector is obtained through this network. Furthermore, the sentence attention network is used to extract the features from the syntax-aware vector to distinguish between different authors, thereby generating the sentence representation. Finally, the result is obtained by the classifier and the correct rate is used to evaluate the result. [Results] Experiments on CCAT10, CCAT50, IMDb62, and the Chinese novel data sets show that an increase in the number of authors causes a downward trend in the accuracy rate of the model proposed in the paper. In some data points, an increase in the number of authors resulted in an increase instead of a decrease in the correct rate. This shows that the ability of the model proposed in this study to capture the writing style of different authors is considerably different. Furthermore, when we change the number of authors on the IMDb dataset, the correct rate of the model in the paper is found to be slightly lower than the BertAA model in the case of 5 authors; however, the rate is higher than the BertAA model in the case of 10, 25, and 50 authors. Additionally, when the experimental results of the model are compared to other models on the CCAT10, CCAT50, and IMDb62 data sets, the performance of this model is observed to be ranked as second or third. [Conclusions] The attention mechanism demonstrated its efficiency in text feature mining, which can fully capture an author's style that is reflected in different parts of the document. The integration of lexical and syntactic features based on the attention mechanism enhances the overall performance of the model. Our model performs well on different Chinese and English datasets. Notably, the introduction of dependency syntactic combination provides more space for the interpretation of the model, which can explain the text styles of different authors at the word selection and sentence construction levels.
Key wordsauthorship identification    node of the syntax tree    dependency    attention mechanism
收稿日期: 2022-04-25      出版日期: 2023-08-19
基金资助:国家自然科学基金重点项目(62036001)
通讯作者: 江铭虎,教授,E-mail:jiang.mh@tsinghua.edu.cn      E-mail: jiang.mh@tsinghua.edu.cn
作者简介: 张洋(1990-),男,博士研究生。
引用本文:   
张洋, 江铭虎. 基于句法树节点嵌入的作者识别方法[J]. 清华大学学报(自然科学版), 2023, 63(9): 1390-1398.
ZHANG Yang, JIANG Minghu. Authorship identification method based on the embedding of the syntax tree node. Journal of Tsinghua University(Science and Technology), 2023, 63(9): 1390-1398.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2023.21.013  或          http://jst.tsinghuajournals.com/CN/Y2023/V63/I9/1390
  
  
  
  
  
  
  
  
  
  
  
[1] DAELEMANS W. Explanation in computational stylometry[C]//Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing. Karlovasi, Greece: Springer, 2013: 451-462.
[2] STAMATATOS E. Ensemble-based author identification using character N-grams[C]//Proceedings of the 3rd International Workshop on Text-based Information Retrieval. 2006: 41-46.
[3] MARTINC M, ŠKRJANEC I, ZUPAN K, et al. PAN 2017: Author profiling-gender and language variety prediction[C]//Working Notes of the Conference and Labs of the Evaluation Forum 2017. Dublin, Ireland: CLEF, 2017.
[4] SARI Y, VLACHOS A, STEVENSON M. Continuous N-gram representations for authorship attribution[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain: Association for Computational Linguistics, 2017: 267-273.
[5] MARTÍN-DEL-CAMPO-RODRÍGUEZ C, ALVAREZ D A P, SIFUENTES C E M, et al. Authorship attribution through punctuation n-grams and averaged combination of SVM[C]//Working Notes of the Conference and Labs of the Evaluation Forum 2019. Lugano, Switzerland: CLEF, 2019.
[6] SIDOROV G, VELASQUEZ F, STAMATATOS E, et al. Syntactic N-grams as machine learning features for natural language processing[J]. Expert Systems with Applications, 2014, 41(3): 853-860.
[7] KEŠELJ V, PENG F C, CERCONE N, et al. N-gram-based author profiles for authorship attribution[C]//Proceedings of the Pacific Association for Computational Linguistics. Halifax, Canada: Pacific Association for Computational Linguistics, 2003: 255-264.
[8] HOUVARDAS J, STAMATATOS E. N-gram feature selection for authorship identification[C]//Proceedings of the 12th International Conference on Artificial Intelligence: Methodology, Systems, and Applications. Varna, Bulgaria: Springer, 2006: 77-86.
[9] GARCÍA A M, MARTÍN J C. Function words in authorship attribution studies[J]. Literary and Linguistic Computing, 2007, 22(1): 49-66.
[10] TSCHUGGNALL M, SPECHT G. Enhancing authorship attribution by utilizing syntax tree profiles[C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden: Association for Computational Linguistics, 2014: 195-199.
[11] BOUANANI S E M E, KASSOU I. Authorship analysis studies: A survey[J]. International Journal of Computer Applications, 2014, 86(12): 22-29.
[12] RAGHAVAN S, KOVASHKA A, MOONEY R. Authorship attribution using probabilistic context-free grammars[C]//Proceedings of the ACL 2010 Conference Short Papers. Uppsala, Sweden: Association for Computational Linguistics, 2010: 38-42.
[13] JAFARIAKINABAD F, TARNPRADAB S, HUA K A. Syntactic recurrent neural network for authorship attribution[R/OL]. (2019-02-27)[2022-03-18]. https://arxiv.org/pdf/1902.09723.pdf.
[14] JAFARIAKINABAD F, HUA K A. Style-aware neural model with application in authorship attribution[C]//Proceedings of the 18th IEEE International Conference on Machine Learning and Applications. Boca Raton, USA: IEEE, 2019: 325-328.
[15] ZHANG R C, HU Z Y, GUO H Y, et al. Syntax encoding with application in authorship attribution[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018: 2742-2753.
[16] STAMATATOS E. Author identification: Using text sampling to handle the class imbalance problem[J]. Information Processing & Management, 2008, 44(2): 790-799.
[17] SEROUSSI Y, ZUKERMAN I, BOHNERT F. Collaborative inference of sentiments from texts[C]//Proceedings of the 18th International Conference on User Modeling, Adaptation, and Personalization. Big Island, USA: Springer, 2010: 195-206.
[18] 张洋, 江铭虎. 基于注意力机制的文本作者识别[J]. 计算机应用, 2021, 41(7): 1897-1901. ZHANG Y, JIANG M H. Authorship identification of text based on attention mechanism[J]. Journal of Computer Applications, 2021, 41(7): 1897-1901. (in Chinese)
[19] SAPKOTA U, BETHARD S, MONTES-Y-GÓMEZ M, et al. Not all character n-grams are created equal: A study in authorship attribution[C]//Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado: Association for Computational Linguistics, 2015: 93-102.
[20] PLAKIAS S, STAMATATOS E. Tensor space models for authorship identification[C]//Proceedings of the 5th Hellenic Conference on Artificial Intelligence: Theories, Models and Applications. Syros, Greece: Springer, 2008: 239-249.
[21] ESCALANTE H J, SOLORIO T, MONTES-Y-GÓMEZ M. Local histograms of character n-grams for authorship attribu- tion[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, USA: Association for Computational Linguistics, 2011: 288-298.
[22] SEROUSSI Y, ZUKERMAN I, BOHNERT F. Authorship attribution with topic models[J]. Computational Linguistics, 2014, 40(2): 269-310.
[23] RUDER S, GHAFFARI P, BRESLIN J G, et al. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution[R/OL]. (2016-09-21)[2022-03-18]. https://arxiv.org/pdf/1609.06686.pdf.
[24] FABIEN M, VILLATORO-TELLO E, MOTLICEK P, et al. BertAA: BERT fine-tuning for authorship attribution[C]// Proceedings of the 17th International Conference on Natural Language Processing. Patna, India: ACL, 2020: 127-137.
[25] 吴海燕. 基于神经网络的句法结构建模与应用研究[D]. 北京: 清华大学, 2020. WU H Y. Syntactic structure modeling and application based on neural networks[D]. Beijing: Tsinghua University, 2020. (in Chinese)
[1] 黄贲, 康飞, 唐玉. 基于目标检测的混凝土坝裂缝实时检测方法[J]. 清华大学学报(自然科学版), 2023, 63(7): 1078-1086.
[2] 周迅, 李永龙, 周颖玥, 王皓冉, 李佳阳, 赵家琦. 基于改进DeepLabV3+网络的坝面裂缝检测方法[J]. 清华大学学报(自然科学版), 2023, 63(7): 1153-1163.
[3] 逯波, 段晓东, 袁野. 面向跨模态检索的自监督深度语义保持Hash[J]. 清华大学学报(自然科学版), 2022, 62(9): 1442-1449.
[4] 杨宏宇, 张梓锌, 张良. 基于并行特征提取和改进BiGRU的网络安全态势评估[J]. 清华大学学报(自然科学版), 2022, 62(5): 842-848.
[5] 李明扬, 孔芳. 融入自注意力机制的社交媒体命名实体识别[J]. 清华大学学报(自然科学版), 2019, 59(6): 461-467.
[6] 张宇, 张鹏远, 颜永红. 基于注意力LSTM和多任务学习的远场语音识别[J]. 清华大学学报(自然科学版), 2018, 58(3): 249-253.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn