Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2024, Vol. 64 Issue (4): 679-687    DOI: 10.16511/j.cnki.qhdxxb.2024.21.006
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于部分标注的自训练多标签文本分类框架
任俊飞, 朱桐, 陈文亮
苏州大学 计算机科学与技术学院, 苏州 215006
Self-training with partial labeling for multi-label text classification
REN Junfei, ZHU Tong, CHEN Wenliang
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
全文: PDF(3102 KB)   HTML
输出: BibTeX | EndNote (RIS)      
摘要 多标签文本分类(multi-label text classification, MLTC)旨在从预定义的候选标签中选择一个或多个文本相关的类别, 是自然语言处理(natural language processing, NLP)的一项基本任务。 前人工作大多基于规范且全面的标注数据集, 而这些规范数据集需要严格的质量控制, 一般很难获取。 在真实的标注过程中, 难免会缺失标注一些相关标签, 进而导致不完全标注问题。 该文提出了一种基于部分标注的自训练多标签文本分类(partial labeling self-training for multi-label text classification, PST)框架, 该框架利用教师模型自动地给大规模无标注数据分配标签, 同时给不完全标注数据补充缺失标签, 最后再利用这些数据反向更新教师模型。 在合成数据集和真实数据集上的实验表明, PST框架兼容现有的各类多标签文本分类模型, 并且可以缓解不完全标注数据对模型的影响。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
任俊飞
朱桐
陈文亮
关键词 多标签文本分类不完全标注自训练    
Abstract:[Objective] Multi-label text classification (MLTC) is a fundamental task in natural language processing. It selects the most relevant labels from the predefined label set to annotate texts. Most previous studies have been conducted on standardized and comprehensive datasets with manual annotations, which require strict quality control and are difficult to procure. In the real annotation process, some related labels are always lost, resulting in incomplete annotation. The impact of this missing label on the model is primarily divided into two forms: 1) degradation effect: numerous missing labels lead to a decrease in the number of positive example labels related to the text, and the model cannot obtain more comprehensive and complete information under the training of a few related labels; 2) misleading effect: numerous missing labels are treated as negative example labels that are unrelated to the text during model training, thereby misleading the model to learn the opposite information. MLTC for incomplete annotation aims to learn text from incomplete annotation datasets to classifiers for related labels while minimizing their impact on the model and improving the efficiency of multi label classification. All existing methods for MLTC involve supervised training on manually annotated data, which cannot solve missing incomplete labels. [Methods] This article proposes partial labeling self-training for the MLTC (PST) framework based on local annotation, which alleviates the negative impact of missing labels on the model by supplementing the use of missing labels. Particularly, the PST framework first utilized the basic multi label text classification model to train on incompletely labeled datasets to obtain a teacher model. Furthermore, the teacher model automatically scored large-scale unlabeled and incompletely labeled data. A dual threshold mechanism was then used to divide the labels into states based on their scores to obtain positive, negative, and other labels. Finally, the teacher model was updated using label information from three different states through joint training. To comprehensively evaluate the performance of the PST framework, we randomly deleted some labels from the training set of the English dataset AAPD, according to different missing ratios, to construct incomplete annotated synthetic datasets with different degrees of missing data. Meanwhile, we manually corrected the incomplete CCKS2022 Task 8 dataset with incomplete annotations and used it as the real dataset for the experiment. [Results] Experiments on synthetic datasets showed that as the problem of annotation intensifies, the performance of multi label text classification models decreases sharply, and the PST framework could alleviate the speed of decline to some extent, in which the more the missing labels, the more obvious the relief. The experimental results of different multi-label classification teacher models on real datasets showed that the PST framework has varying degrees of improvement on different teacher models on incompletely annotated datasets, which fully proves the universality of the PST framework. [Conclusions] The PST framework is a model-independent plug-in framework that is compatible with various teacher models. We could fully utilize the external unlabeled data to optimize the teacher model, while supplementing the use of missing labels from incomplete labeled data, thereby weakening the impact of missing labels on the model. The experimental results indicate that our proposed framework is universal and can alleviate the impact of incomplete data annotation to some extent.
Key wordsmulti-label text classification    incomplete labeling    self-training
收稿日期: 2023-11-09      出版日期: 2024-03-27
基金资助:国家自然科学基金重点联合项目(61936010)
通讯作者: 陈文亮,教授,E-mail:wlchen@suda.edu.cn     E-mail: wlchen@suda.edu.cn
作者简介: 任俊飞(2000—),男,硕士研究生。
引用本文:   
任俊飞, 朱桐, 陈文亮. 基于部分标注的自训练多标签文本分类框架[J]. 清华大学学报(自然科学版), 2024, 64(4): 679-687.
REN Junfei, ZHU Tong, CHEN Wenliang. Self-training with partial labeling for multi-label text classification. Journal of Tsinghua University(Science and Technology), 2024, 64(4): 679-687.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2024.21.006  或          http://jst.tsinghuajournals.com/CN/Y2024/V64/I4/679
  
  
  
  
  
  
  
[1] LI X, XIE H R, RAO Y H, et al. Weighted multi-label classification model for sentiment analysis of online news[C]//Proceedings of the 2016 International Conference on Big Data and Smart Computing. Hong Kong, China: IEEE, 2016:215-222.
[2] DOUGREZ J, LIAKATA M, KOCHKINA E, et al. Learning disentangled latent topics for twitter rumour veracity classification[C]//Findings of the Association for Computational Linguistics. Bangkok, Thailand: ACL, 2021:3902-3908.
[3] LANGTON J, SRIHASAM K, JIANG J L. Comparison of machine learning methods for multi-label classification of nursing education and licensure exam questions[C]//Proceedings of the 3rd Clinical Natural Language Processing Workshop. Online: ACL, 2020: 85-93.
[4] JAIN H, PRABHU Y, VARMA M. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, USA: ACM, 2016:935-944.
[5] LIU J Z, CHANG W C, WU Y X, et al. Deep learning for extreme multi-label text classification[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. Shinjuku, Japan: ACM, 2017: 115-124.
[6] 肖琳, 陈博理, 黄鑫, 等. 基于标签语义注意力的多标签文本分类[J]. 软件学报, 2020, 31(4): 1079-1089. XIAO L, CHEN B L, HUANG X, et al. Multi-label text classification method based on label semantic information[J]. Journal of Software, 2020, 31(4): 1079-1089.(in Chinese)
[7] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 2999-3007.
[8] LI Y M, LIU L M, SHI S M. Rethinking negative sampling for handling missing entity annotations[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland: ACL, 2022: 7188-7197.
[9] DU C X, CHEN Z Z, FENG F L, et al. Explicit interaction model towards text classification[C]//Proceedings of the AAAI 33rd Conference on Artificial Intelligence. Honolulu, USA: AAAI, 2019: 6359-6366.
[10] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1(Long and Short Papers). Minneapolis, Minnesota: ACL, 2019: 4171-4186.
[11] CUI Y, JIA M L, LIN T Y, et al. Class-balanced loss based on effective number of samples[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019: 9260-9269.
[12] YANG P C, SUN X, LI W, et al. SGM: Sequence generation model for multi-label classification[C]//Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, USA: ACL, 2018: 3915-3926.
[13] LI Y M, LIU L M SHI S M. Empirical analysis of unlabeled entity problem in named entity recognition[C]//Proceedings of the 9th International Conference on Learning Representations. Vienna, Austria: ICLR, 2021.
[14] KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL, 2014: 1746-1751.
[15] XIAO L, HUANG X, CHEN B L, et al. Label-specific document representation for multi-label text classifica- tion[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Processing. Hong Kong, China: ACL, 2019: 466-475.
[16] WU T, HUANG Q Q, LIU Z W, et al. Distribution-balanced loss for multi-label classification in long-tailed datasets[C]//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020:162-178.
[17] XIAO L, ZHANG X L, JING L P, et al. Does head label help for long-tailed multi-label text classification[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2021: 14103-14111.
[18] ZHANG X M, ZHANG Q W, YAN Z, et al. Enhancing label correlation feedback in multi-label text classification via multi-task learning[C]//Findings of the Association for Computational Linguistics. Bangkok, Thailand: ACL, 2021: 1190-1200.
[19] ZHAO X Y, AN Y X, XU N, et al. Fusion label enhancement for multi-label learning[C]//Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. Shenzhen, China: IJCAI. 2022: 3773-3779.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn