Abstract:[Objective] Multi-label text classification (MLTC) is a fundamental task in natural language processing. It selects the most relevant labels from the predefined label set to annotate texts. Most previous studies have been conducted on standardized and comprehensive datasets with manual annotations, which require strict quality control and are difficult to procure. In the real annotation process, some related labels are always lost, resulting in incomplete annotation. The impact of this missing label on the model is primarily divided into two forms: 1) degradation effect: numerous missing labels lead to a decrease in the number of positive example labels related to the text, and the model cannot obtain more comprehensive and complete information under the training of a few related labels; 2) misleading effect: numerous missing labels are treated as negative example labels that are unrelated to the text during model training, thereby misleading the model to learn the opposite information. MLTC for incomplete annotation aims to learn text from incomplete annotation datasets to classifiers for related labels while minimizing their impact on the model and improving the efficiency of multi label classification. All existing methods for MLTC involve supervised training on manually annotated data, which cannot solve missing incomplete labels. [Methods] This article proposes partial labeling self-training for the MLTC (PST) framework based on local annotation, which alleviates the negative impact of missing labels on the model by supplementing the use of missing labels. Particularly, the PST framework first utilized the basic multi label text classification model to train on incompletely labeled datasets to obtain a teacher model. Furthermore, the teacher model automatically scored large-scale unlabeled and incompletely labeled data. A dual threshold mechanism was then used to divide the labels into states based on their scores to obtain positive, negative, and other labels. Finally, the teacher model was updated using label information from three different states through joint training. To comprehensively evaluate the performance of the PST framework, we randomly deleted some labels from the training set of the English dataset AAPD, according to different missing ratios, to construct incomplete annotated synthetic datasets with different degrees of missing data. Meanwhile, we manually corrected the incomplete CCKS2022 Task 8 dataset with incomplete annotations and used it as the real dataset for the experiment. [Results] Experiments on synthetic datasets showed that as the problem of annotation intensifies, the performance of multi label text classification models decreases sharply, and the PST framework could alleviate the speed of decline to some extent, in which the more the missing labels, the more obvious the relief. The experimental results of different multi-label classification teacher models on real datasets showed that the PST framework has varying degrees of improvement on different teacher models on incompletely annotated datasets, which fully proves the universality of the PST framework. [Conclusions] The PST framework is a model-independent plug-in framework that is compatible with various teacher models. We could fully utilize the external unlabeled data to optimize the teacher model, while supplementing the use of missing labels from incomplete labeled data, thereby weakening the impact of missing labels on the model. The experimental results indicate that our proposed framework is universal and can alleviate the impact of incomplete data annotation to some extent.
[1] LI X, XIE H R, RAO Y H, et al. Weighted multi-label classification model for sentiment analysis of online news[C]//Proceedings of the 2016 International Conference on Big Data and Smart Computing. Hong Kong, China: IEEE, 2016:215-222. [2] DOUGREZ J, LIAKATA M, KOCHKINA E, et al. Learning disentangled latent topics for twitter rumour veracity classification[C]//Findings of the Association for Computational Linguistics. Bangkok, Thailand: ACL, 2021:3902-3908. [3] LANGTON J, SRIHASAM K, JIANG J L. Comparison of machine learning methods for multi-label classification of nursing education and licensure exam questions[C]//Proceedings of the 3rd Clinical Natural Language Processing Workshop. Online: ACL, 2020: 85-93. [4] JAIN H, PRABHU Y, VARMA M. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, USA: ACM, 2016:935-944. [5] LIU J Z, CHANG W C, WU Y X, et al. Deep learning for extreme multi-label text classification[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. Shinjuku, Japan: ACM, 2017: 115-124. [6] 肖琳, 陈博理, 黄鑫, 等. 基于标签语义注意力的多标签文本分类[J]. 软件学报, 2020, 31(4): 1079-1089. XIAO L, CHEN B L, HUANG X, et al. Multi-label text classification method based on label semantic information[J]. Journal of Software, 2020, 31(4): 1079-1089.(in Chinese) [7] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 2999-3007. [8] LI Y M, LIU L M, SHI S M. Rethinking negative sampling for handling missing entity annotations[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland: ACL, 2022: 7188-7197. [9] DU C X, CHEN Z Z, FENG F L, et al. Explicit interaction model towards text classification[C]//Proceedings of the AAAI 33rd Conference on Artificial Intelligence. Honolulu, USA: AAAI, 2019: 6359-6366. [10] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1(Long and Short Papers). Minneapolis, Minnesota: ACL, 2019: 4171-4186. [11] CUI Y, JIA M L, LIN T Y, et al. Class-balanced loss based on effective number of samples[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE, 2019: 9260-9269. [12] YANG P C, SUN X, LI W, et al. SGM: Sequence generation model for multi-label classification[C]//Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, USA: ACL, 2018: 3915-3926. [13] LI Y M, LIU L M SHI S M. Empirical analysis of unlabeled entity problem in named entity recognition[C]//Proceedings of the 9th International Conference on Learning Representations. Vienna, Austria: ICLR, 2021. [14] KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL, 2014: 1746-1751. [15] XIAO L, HUANG X, CHEN B L, et al. Label-specific document representation for multi-label text classifica- tion[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Processing. Hong Kong, China: ACL, 2019: 466-475. [16] WU T, HUANG Q Q, LIU Z W, et al. Distribution-balanced loss for multi-label classification in long-tailed datasets[C]//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020:162-178. [17] XIAO L, ZHANG X L, JING L P, et al. Does head label help for long-tailed multi-label text classification[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2021: 14103-14111. [18] ZHANG X M, ZHANG Q W, YAN Z, et al. Enhancing label correlation feedback in multi-label text classification via multi-task learning[C]//Findings of the Association for Computational Linguistics. Bangkok, Thailand: ACL, 2021: 1190-1200. [19] ZHAO X Y, AN Y X, XU N, et al. Fusion label enhancement for multi-label learning[C]//Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. Shenzhen, China: IJCAI. 2022: 3773-3779.