基于预训练模型的半监督说话人验证系统

李宜爽, 陈智聪, 缪诗宇, 苏祺, 李琳, 洪青阳

清华大学学报(自然科学版) ›› 2024, Vol. 64 ›› Issue (11) : 1936-1943.

PDF(3869 KB)
PDF(3869 KB)
清华大学学报(自然科学版) ›› 2024, Vol. 64 ›› Issue (11) : 1936-1943. DOI: 10.16511/j.cnki.qhdxxb.2024.26.048
专题:人机语音通讯

基于预训练模型的半监督说话人验证系统

  • 李宜爽1,2, 陈智聪2, 缪诗宇2, 苏祺2, 李琳1,2, 洪青阳3
作者信息 +

Semi-supervised speaker verification system based on pre-trained models

  • LI Yishuang1,2, CHEN Zhicong2, MIAO Shiyu2, SU Qi2, LI Lin1,2, HONG Qingyang3
Author information +
文章历史 +

摘要

近年来, 预训练模型(pre-trained models, PTMs)被广泛应用于说话人验证(speaker verification, SV)系统, 通过在预训练模型下游接入说话人分类网络, 并进行微调, 可大幅提升系统性能。然而, 目前基于预训练模型的SV研究大多在有标签的数据集上进行微调, 需要大量目标域带标注数据。该文提出一种基于预训练模型的半监督说话人验证系统, 首先, 利用少量带标注数据训练一个种子模型; 其次, 利用该种子模型结合无监督聚类算法为无标注数据生成伪标签; 再次, 联合真实标注数据和伪标注数据进行模型重训练; 最后, 通过多轮迭代提升模型性能。在仅有100 h带标签说话人数据的条件下, 该文提出的半监督系统在VoxCeleb1-O测试集的等错误率为1.02%, 比基线系统降低了86.8%, 表明该文所提出的半监督说话人验证系统的有效性。

Abstract

[Objective] In the evolving landscape of speaker verification (SV) systems, pre-trained models (PTMs) have become a cornerstone, significantly enhancing system performance through the integration of a speaker classification network and subsequent fine-tuning. Despite the advancements, current research mainly focuses on fine-tuning with labeled datasets, which poses a challenge owing to the necessity for a large amount of annotated data in the target domain. Therefore, this paper proposes a semi-supervised SV system leveraging PTMs, designed to excel under conditions of limited annotated data. [Methods] The proposed semi-supervised SV framework based on PTMs consists of several main steps. Initially, the entire model is fine-tuned using a small amount of labeled data, approximately 100 h, to create a high-performance seed model, referred to as model J. This model serves to extract speaker embeddings from a large unlabeled audio dataset. Using these embeddings, a speaker embedding graph is constructed, which is processed by the Infomap clustering module to generate pseudo-labels for each audio sample based on their clustering category. Next, the original labeled data is combined with the newly pseudo-labeled data for a comprehensive retraining from scratch, resulting in model B and marking the completion of the first iteration. Finally, the parameters of model C are fixed, and model C is then re-established as the seed model. The above steps are iteratively repeated to refine and develop the final SV system, referred to as model F. [Results] The experiments, conducted on the VoxCeleb dataset, demonstrated the efficacy of the PTMs-based system in low-resource scenarios, specifically with 100 h of labeled VoxCeleb2 data. Notably, the semi-supervised framework showcased a remarkable improvement in speaker recognition performance, achieving a relative equal error rate (EER) reduction of 71.2% compared to the baseline SV system. Additionally, the semi-supervised system displayed competitive performance against fully supervised systems across all three VoxCeleb1 test sets, with EERs of 1.25%, 1.29%, and 2.45% on the VoxCeleb1-O/E/H test sets, respectively. Following the second iteration, a significant enhancement in performance could be observed, which, after subsequent iterations, began to converge, ultimately achieving an EER of 1.02% on the VoxCeleb1-O test set. Compared to the baseline system, the EER decreased by 86.8%. [Conclusions] This paper proposes a semi-supervised SV system utilizing PTMs tailored for scenarios with limited resources. By incorporating unlabeled audio data, the system leverages PTMs, the Infomap clustering algorithm, and a pseudo-label correction technique. The experimental results underscore the efficiency of the proposed semi-supervised training framework. Remarkably, even when restricted to merely 100 h of labeled data, the system achieves performance levels comparable to those of traditional fully supervised baseline systems. Furthermore, through multiple rounds of iterative training, a notable improvement in the system performance can be observed.

关键词

说话人验证 / 预训练模型 / 微调 / 半监督学习 / 聚类

Key words

speaker verification / pre-trained models / fine-tuning / semi-supervised learning / clustering

引用本文

导出引用
李宜爽, 陈智聪, 缪诗宇, 苏祺, 李琳, 洪青阳. 基于预训练模型的半监督说话人验证系统[J]. 清华大学学报(自然科学版). 2024, 64(11): 1936-1943 https://doi.org/10.16511/j.cnki.qhdxxb.2024.26.048
LI Yishuang, CHEN Zhicong, MIAO Shiyu, SU Qi, LI Lin, HONG Qingyang. Semi-supervised speaker verification system based on pre-trained models[J]. Journal of Tsinghua University(Science and Technology). 2024, 64(11): 1936-1943 https://doi.org/10.16511/j.cnki.qhdxxb.2024.26.048

参考文献

[1] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [C] //Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, USA: ACL, 2019: 4171-4186.
[2] YANG Z L, DAI Z H, YANG Y M, et al. XLNet: Generalized autoregressive pretraining for language understanding [C] //Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM, 2019: 517.
[3] LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT: A lite BERT for self-supervised learning of language representations [C] //Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: ICLR, 2020: 1-17.
[4] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners [C] //Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: ACM, 2020: 159.
[5] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. wav2vec: Unsupervised pre-training for speech recognition [C] //Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019: 3465-3469.
[6] BAEVSKI A, ZHOU Y H, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations [C] //Proceedings of the 34th Advances in Neural Information Processing Systems. Vancouver, Canada: ACM, 2020: 12449-12460.
[7] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C] //Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: ACM, 2017: 6000-6010.
[8] HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460.
[9] CHEN S Y, WU Y, WANG C Y, et al. UniSpeech-SAT: Universal speech representation learning with speaker aware pre-training [C] //Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore: IEEE, 2022: 6152-6156.
[10] CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing [J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505-1518.
[11] YANG S W, CHI P H, CHUANG Y S, et al. SUPERB: Speech processing universal PERformance benchmark [C] //Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA, 2021: 1194-1198.
[12] SANKALA S, RAFI B S M, MURTY K S R. Multi-feature integration for speaker embedding extraction [C] //Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore: IEEE, 2022: 7957-7961.
[13] VAESSEN N, VAN LEEUWEN D A. Fine-tuning wav2vec2 for speaker recognition [C] //Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore: IEEE, 2022: 7967-7971.
[14] VAESSEN N, VAN LEEUWEN D A. Training speaker recognition systems with limited data [C] //Proceedings of the 23rd Annual Conference of the International Speech Communication Association. Incheon, South of Korea: ISCA, 2022: 4760-4764.
[15] DESPLANQUES B, THIENPONDT J, DEMUYNCK K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification [C] //Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 3830-3834.
[16] CHEN Z Y, CHEN S Y, WU Y, et al. Large-scale self-supervised speech representation learning for automatic speaker verification [C] //Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore: IEEE, 2022: 6147-6151.
[17] NOVOSELOV S, LAVRENTYEVA G, AVDEEVA A, et al. Robust speaker recognition with transformers using wav2vec 2.0 [Z/OL]. (2022-03-28) [2023-12-28]. https://arxiv.org/abs/2203.15095.
[18] NAGRANI A, CHUNG J S, XIE W D, et al. VoxCeleb: Large-scale speaker verification in the wild [J]. Computer Speech & Language, 2020, 60: 101027.
[19] CHUNG J S, NAGRANI A, ZISSERMAN A. VoxCeleb2: Deep speaker recognition [C] //Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018: 1086-1090.
[20] CHAPELLE O, SCHOLKOPF B, ZIEN A. Semi-supervised learning [J]. IEEE Transactions on Neural Networks, 2009, 20(3): 542.
[21] INOUE N, GOTO K. Semi-supervised contrastive learning with generalized contrastive loss and its application to speaker recognition [C] //Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Auckland, New Zealand: IEEE, 2020: 1641-1646.
[22] ZHANG H R, ZOU Y X, WANG H L. Contrastive self-supervised learning for text-independent speaker verification [C] //Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE, 2021: 6713-6717.
[23] CHEN Z C, WANG J, HU W X, et al. Unsupervised speaker verification using pre-trained model and label correction [C] //Proceedings of the ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island, Greece: IEEE, 2023: 1-5.
[24] CAI D W, WANG W Q, LI M. An iterative framework for self-supervised deep speaker representation learning [C] //Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE, 2021: 6728-6732.
[25] TAO R J, LEE K A, DAS R K, et al. Self-supervised speaker recognition with loss-gated learning [C] //Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore: IEEE, 2022: 6142-6146.
[26] HAN B, CHEN Z Y, QIAN Y M. Self-supervised speaker verification using dynamic loss-gate and label correction [C] //Proceedings of the 23rd Annual Conference of the International Speech Communication Association. Incheon, South of Korea: ISCA, 2022: 4780-4784.
[27] ROSVALL M, BERGSTROM C T. Maps of random walks on complex networks reveal community structure [J]. Proceedings of the National Academy of Sciences of the United States of America, 2008, 105(4): 1118-1123.
[28] FAN Z Y, LI M, ZHOU S Y, et al. Exploring wav2vec 2.0 on speaker verification and language identification [C] //Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA, 2021: 1509-1513.
[29] CHEN S Y, WU Y, WANG C Y, et al. Why does self-supervised learning for speech recognition benefit speaker recognition [C] //Proceedings of the 23rd Annual Conference of the International Speech Communication Association. Incheon, South of Korea: ISCA, 2022: 3699-3703.
[30] XIANG X, WANG S, HUANG H J, et al. Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition [C] //Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Lanzhou, China: IEEE, 2019: 1652-1656.
[31] TONG F C, ZHAO M, ZHOU J F, et al. ASV-SUBTOOLS: Open source toolkit for automatic speaker verification [C] //Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE, 2021: 6184-6188.

基金

国家自然科学基金项目(62371407,62001405,62276220)

PDF(3869 KB)

Accesses

Citation

Detail

段落导航
相关文章

/