PDF(5128 KB)
基于尖峰特征的口音识别和语音识别多任务学习方法
肖素杰, 黎塔, 郝锐朋, 仁增多杰, 赵庆卫
清华大学学报(自然科学版) ›› 2025, Vol. 65 ›› Issue (7) : 1310-1319.
PDF(5128 KB)
PDF(5128 KB)
基于尖峰特征的口音识别和语音识别多任务学习方法
Multi-task learning method for accent and speech recognition based on spike features
口音是语音识别面临的主要挑战之一,为解决多口音语音识别问题,进一步探索口音识别与语音识别任务之间的相互作用,该文提出一种基于尖峰特征的口音识别和语音识别多任务学习方法。该方法首先通过共享部分编码器底层网络,利用口音信息隐式增强特定口音的声学特征,提升口音语音识别性能;其次,通过分析口音特征可发现,口音特征大部分由空白帧组成,而更能反映口音差异的有效标签帧未发挥主要作用。因此,该文在多任务学习框架下,将有效标签对应的连接主义时序分类(connectionist temporal classification,CTC)尖峰特征作为口音识别的输入特征,从而在一定程度上提升口音识别性能。在英文Common Voice和AESRC2020数据集上的试验结果表明:与直接混合所有数据训练的模型相比,该文方法在语音识别性能上分别绝对提升0.6%和1.0%;在口音识别上,基于CTC尖峰特征的口音识别性能分别绝对提升0.7%和1.9%。
Objective: Accent is a significant challenge in speech recognition. To address the problem of multi accent speech recognition and further explore the interaction between accent and speech recognition tasks, this paper proposes a multi-task learning method for accent and speech recognition based on spike features. The multi-task learning approach can simultaneously model both accent recognition and speech recognition, simplifying system complexity and making the model more compact. In addition, the information shared between tasks can complement each other, improving the performance of each task. However, these methods also have several limitations. First, there should be more exploration and analysis of the accent information captured by different encoder layers and its effect on speech recognition performance. Second, the interaction between the two tasks has yet to be further explored, such as using speech recognition information to assist in accent recognition. Therefore, this paper will conduct further research on the aforementioned issues. Methods: In this paper, a multi-task learning framework for accent and speech recognition is constructed within the ESPnet environment. By sharing parts of the encoder's underlying network, the accent information is used to implicitly enhance the acoustic features of a specific accent to improve the performance of accent and speech recognition. This paper analyzes the encoder's hidden layer features to further investigate the interaction between the two tasks. The results show that the accent features are mainly composed of blank frames, whereas the effective label frames, which better reflect accent differences, do not play a significant role. To address this problem, this paper proposes using connectionist temporal classification (CTC) spike features corresponding to valid labels as the input features for accent recognition within a multi-task learning framework to improve accent recognition performance. During forward propagation in model training, CTC pseudo-label alignment information is calculated, and the corresponding hidden layer encoding features are obtained through the indices of non-blank frames. These features are then subjected to statistical pooling, and the resulting features are used for accent recognition training. All the parameters are updated synchronously during the joint training process. Given that speech sequences are typically much longer than text sequences, experiments are conducted using Spike-Frame and Spike-Chunk features as the accent features, respectively. The Spike-Chunk refers to extending a certain number of frames on both sides of the Spike-Frame index. Results: Experiments conducted on English Common Voice and AESRC2020 datasets demonstrated that the proposed method improved speech recognition performance by absolutely 0.6% and 1.0%, respectively, compared with the model trained with all the data mixed directly. In accent recognition, the performance based on CTC spike features improved by absolutely 0.7% and 1.9%, respectively. Conclusions: This paper proposes a model for both accent recognition and speech recognition within a multi-task learning framework. The experiments show that the accent information can enhance the hidden layer features of the encoder to some extent, thereby improving the speech recognition performance. Meanwhile, the spike features of valid labels in speech recognition also enhance the accent recognition performance to some degree. This study aims to further explore the interaction between accent and speech recognition tasks to improve the performance of both tasks and achieve the expected results.
spike features / accent recognition / speech recognition / multi-task learning
| 1 |
刘明宽, 徐波, 黄泰翼, 等. 音节混淆字典及在汉语口音自适应中的应用研究[J]. 声学学报, 2002, 27 (1): 53- 58.
|
| 2 |
庞程, 王秀玲, 张结, 等. 基于多特征融合的GMM汉语普通话口音识别[J]. 华中科技大学学报(自然科学版), 2015, 43 (S1): 381- 384.
|
| 3 |
张超, 刘轶, 郑方. 面向多口音语音识别的声学模型重构[J]. 清华大学学报(自然科学版), 2011, 51 (9): 1161- 1166.
|
| 4 |
SHI X, YU F, LU Y Z, et al. The accented English speech recognition challenge 2020: Open datasets, tracks, baselines, results and methods[C] // Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada: IEEE, 2021: 6918-6922.
|
| 5 |
DAS N, BODAPATI S, SUNKARA M, et al. Best of both worlds: Robust accented speech recognition with adversarial transfer learning[C] // Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA, 2021: 1314-1318.
|
| 6 |
WINATA G I, CAHYAWIJAYA S, LIU Z H, et al. Learning fast adaptation on cross-accented speech recognition[C] // Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 1276-1280.
|
| 7 |
GONG X, LU Y Z, ZHOU Z K, et al. Layer-wise fast adaptation for end-to-end multi-accent speech recognition[C] // Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA, 2021: 1274-1278.
|
| 8 |
TOMANEK K, ZAYATS V, PADFIELD D, et al. Residual adapters for parameter-efficient ASR adaptation to atypical and accented speech[C] // Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021: 6751-6760.
|
| 9 |
GAO Q, WU H W, SUN Y Q, et al. An end-to-end speech accent recognition method based on hybrid CTC/attention transformer ASR[C] // Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada: IEEE, 2021: 7253-7257.
|
| 10 |
SUN S N, YEH C F, HWANG M Y, et al. Domain adversarial training for accented speech recognition[C] // Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018: 4854-4858.
|
| 11 |
JAIN A, UPRETI M, JYOTHI P, et al. Improved accented speech recognition using accent embeddings and multi-task learning[C] // Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018: 2454-2458.
|
| 12 |
YANG X S, AUDHKHASI K, ROSENBERG A, et al. Joint modeling of accents and acoustics for multi-accent speech recognition[C] // Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018: 1-5.
|
| 13 |
ZHANG J C, PENG Y Z, PHAM V T, et al. E2E-based multi-task learning approach to joint speech and accent recognition[C] // Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA, 2021: 1519-1523.
|
| 14 |
|
| 15 |
WATANABE S, HORI T, KARITA S, et al. ESPnet: End-to-end speech processing toolkit[C] // Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018: 2207-2211.
|
| 16 |
HOU W X, WANG J D, TAN X, et al. Cross-domain speech recognition with unsupervised character-level distribution matching[C] // Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA, 2021: 3425-3429.
|
| 17 |
KIM S, HORI T, WATANABE S. Joint CTC-attention based end-to-end speech recognition using multi-task learning[C] // Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. Orleans, USA: IEEE, 2017: 4835-4839.
|
| 18 |
GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks[C] // Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, USA: ACM, 2006: 369-376.
|
| 19 |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C] // Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc., 2017: 6000-6010.
|
| 20 |
ARDILA R, BRANSON M, DAVIS K, et al. Common Voice: A massively-multilingual speech corpus[C] // Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, 2020: 4218-4222.
|
| 21 |
PANAYOTOV V, CHEN G G, POVEY D, et al. Librispeech: An ASR corpus based on public domain audio books[C] // Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Brisbane, Australia: IEEE, 2015: 5206-5210.
|
| 22 |
GULATI A, QIN J, CHIU C C, et al. Conformer: Convolution-augmented transformer for speech recognition[C] // Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 5036-5040.
|
| 23 |
PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition[C] // Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019: 2613-2617.
|
| 24 |
SHAO Q J, YAN J H, KANG J, et al. Linguistic-acoustic similarity based accent shift for accent recognition[C] // Proceedings of the 23rd Annual Conference of the International Speech Communication Association. Incheon, South of Korea: ISCA, 2022: 3719-3723.
|
| 25 |
DENG K Q, CAO S J, MA L, et al. Improving accent identification and accented speech recognition under a framework of self-supervised learning[C] // Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA, 2021: 1504-1508.
|
/
| 〈 |
|
〉 |