关键词检测旨在从语音中检测出待识别的关键词, 深度神经网络为小型关键词检测任务提供了有效的解决方案。大多数现有关键词检测方法采用Softmax最小化交叉熵损失函数, 假设测试和训练样本来自相同分布, 侧重于在训练集上最大化分类精度, 而未考虑训练集外的未知语音。若训练数据有限, 关键词检测系统在遇到未知语音时, 实现鲁棒性和高准确率仍比较困难。该文研究了开放集学习方法, 结合深度特征编码器和基于卷积原型学习、 互斥点学习的分类器, 用于开放集关键词检测任务。该文提出的关键词检测方法不仅提高了关键词的分类精度, 而且具有较好的非关键词检测性能。在Google Speech Commands数据集V0.01和V0.02, 以及由LibriSpeech衍生的LibriWords数据集上的试验结果表明: 该文提出的关键词检测方法在大多数评估指标上优于基线方法。
Abstract
[Objective] Keyword spotting (KWS) aims to detect recognizable keywords from speech. Deep neural networks have provided effective solutions for KWS in small-scale applications. However, most KWS methods employ Softmax-based cross-entropy loss, assuming that the test and training samples have identical distributions. These methods focus on maximizing the classification accuracy of the training set, often neglecting unknown speech data outside the training samples. This approach can lead to significant challenges in real-world scenarios where limited training data is available and individuals frequently encounter unfamiliar speech. [Methods] This paper introduces a approach to KWS by exploring open-set learning methods that can accommodate the open vocabulary of KWS tasks. These methods combine deep feature encoders with classifiers based on convolutional prototype learning and reciprocal point learning. For convolutional prototype learning, this paper first replaces the Softmax network with the prototype network to eliminate the closed-world assumption. Subsequently, constructs prototypes for each keyword that represent class-level features in the feature space. This paper uses a distance-based method to represent the similarity between the sample and the keyword for classification, maximizing the likelihood probability of the sample. To effectively reject non-keywords, this paper applies a regularization constraint on the boundary of the prototypes, which improves the robustness of the system. For reciprocal point learning, this paper constructs reciprocal points that represent features not associated with the keyword class. This paper assumes that the probability of a sample belonging to a keyword is proportional to the distance between this point and the reciprocal point, and uses this as a classification criterion. To detect non-keywords, this paper restricts the boundary range of reciprocal points. In addition, this paper explores variants of reciprocal point learning, such as adversarial reciprocal point learning, which uses a more effective distance function and an adequate boundary constraint to further improve system performance. The backbone network used for training the small-footprint KWS systems is ResNet 15. The KWS system developed from these methods not only enhances the classification accuracy but also improves the detection of non-keyword categories. This paper employs classification accuracy (ACC), macro-averaged F1 score, and area under the receiver operating characteristic curve (AUC) to measure the performance of the proposed methods. [Results] This paper conducted experiments on Google Speech Command (GSC) datasets V0.01 and V0.02, as well as the LibriWords dataset derived from LibriSpeech, to evaluate the performance of the proposed method. The results showed that the proposed method outperforms the baseline approaches in most evaluation metrics. The proposed method, which was grounded on reciprocal point learning, achieved the best performance in terms of classification ACC. In addition, methods based on generalized convolution prototype learning and adversarial reciprocal point learning equaled or even surpassed the performance of the baseline methods. When detecting non-keywords, the method based on adversarial reciprocal point learning exhibited the best performance on the GSC dataset. As the number of non-keywords in the LibriWords dataset increases, the method employing generalized convolutional prototype loss achieved optimal detection performance. [Conclusions] By introducing generalized convolution prototype learning and reciprocal point learning, this paper significantly improves the performance of the KWS system in open scenarios. The experimental results show that the proposed method significantly outperforms existing approaches on small-footprint systems with limited training data.
关键词
有限训练数据 /
关键词检测 /
开放集识别 /
原型学习
Key words
limited training data /
keywork spotting /
open set recognition /
prototype learning
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] CHEN G G, PARADA C, HEIGOLD G. Small-footprint keyword spotting using deep neural networks [C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy: IEEE, 2014: 4087-4091.
[2] ARIK S O, KLIEGL M, CHILD R, et al. Convolutional recurrent neural networks for small-footprint keyword spotting [C]//18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017: 1606-1610.
[3] TANG R, LIN J. Deep residual learning for small-footprint keyword spotting [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018: 5484-5488.
[4] SHAN C H, ZHANG J B, WANG Y J, et al. Attention-based end-to-end models for small-footprint keyword spotting [C]//19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018: 2037-2041.
[5] XU M L, ZHANG X L. Depthwise separable convolutional ResNet with squeeze-and-excitation blocks for small-footprint keyword spotting [C]//21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 2547-2551.
[6] YANG C, WEN X, SONG L M. Multi-scale convolution for robust keyword spotting [C]//21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 2577-2581.
[7] ZHANG P, ZHANG X L. Deep template matching for small-footprint and configurable keyword spotting [C]//21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 2572-2576.
[8] HUH J, LEE M, HEO H, et al. Metric learning for keyword spotting [C]//2021 IEEE Spoken Language Technology Workshop. Shenzhen, China: IEEE, 2021: 133-140.
[9] VYGON R, MIKHAYLOVSKIY N. Learning efficient representations for keyword spotting with triplet loss [C]//23rd International Conference on Speech and Computer. St. Petersburg, Russia: Springer, 2021: 773-785.
[10] JUNG J, KIM Y, PARK J, et al. Metric learning for user-defined keyword spotting [C]//ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island, Greece: IEEE, 2023: 1-5.
[11] RUSCI M, TUYTELAARS T. Few-shot open-set learning for on-device customization of keyword spotting systems [J/OL]. arXiv. (2023-06-03) [2023-10-01]. https://arxiv.org/abs/2306.02161.
[12] XU M L, LI S Q, LIANG C D, et al. Multi-class AUC optimization for robust small-footprint keyword spotting with limited training data [C]//23rd Annual Conference of the International Speech Communication Association. Incheon, South of Korea: ISCA, 2022: 3278-3282.
[13] BENDALE A, BOULT T E. Towards open set deep networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 1563-1572.
[14] YANG H M, ZHANG X Y, YIN F, et al. Convolutional prototype network for open set recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(5): 2358-2370.
[15] CHEN G Y, PENG P X, WANG X Q, et al. Adversarial reciprocal points learning for open set recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(11): 8065-8081.
[16] WARDEN P. Speech commands: A public dataset for single-word speech recognition [DB/OL]. [2023-10-01]. http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz.
[17] WARDEN P. Speech commands: A dataset for limited- vocabulary speech recognition [DB/OL]. [2023-10-01]. http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz.
[18] SAINATH T N, PARADA C. Convolutional neural networks for small-footprint keyword spotting [C]//16th Annual Conference of the International Speech Communication Association. Dresden, Germany: ISCA, 2015: 1478-1482.
[19] ZHANG Y D, SUDA N, LAI L Z, et al. Hello edge: Keyword spotting on microcontrollers [J/OL]. arXiv. (2017-11-20) [2023-10-01]. https://arxiv.org/abs/1711.07128.
[20] MITTERMAIER S, KÜRZINGER L, WASCHNECK B, et al. Small-footprint keyword spotting on raw audio data with sinc-convolutions [C]//ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE, 2020: 7454-7458.
[21] CHOI S, SEO S, SHIN B, et al. Temporal convolution for real-time keyword spotting on mobile devices [C]//20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019: 3372-3376.
[22] HOFFER E, AILON N. Deep metric learning using triplet network [C]//Third International Workshop on Similarity-Based Pattern Recognition. Copenhagen, Denmark: Springer, 2015: 84-92.
[23] CHOPRA S, HADSELL R, LECUN Y. Learning a similarity metric discriminatively, with application to face verification [C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005: 539-546.
[24] PARKHI O M, VEDALDI A, ZISSERMAN A. Deep face recognition [C]//Proceedings of the British Machine Vision Conference 2015. Swansea, UK: British Machine Vision Association Press, 2015: 1-12.
[25] SCHROFF F, KALENICHENKO D, PHILBIN J. FaceNet: A unified embedding for face recognition and clustering [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 815-823.
[26] NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: A large-scale speaker identification dataset [C]//18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017: 2616-2620.
[27] ZHANG C L, KOISHIDA K. End-to-end text-independent speaker verification with triplet loss on short utterances [C]//18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017: 1487-1491.
[28] PARNAMI A, LEE M. Few-shot keyword spotting with prototypical networks [C]//2022 7th International Conference on Machine Learning Technologies. Rome, Italy: ACM, 2022: 277-283.
[29] SETH H, KUMAR P, SRIVASTAVA M M. Prototypical metric transfer learning for continuous speech keyword spotting with limited training data [C]//14th International Conference on Soft Computing Models in Industrial and Environmental Applications. Seville, Spain: Springer, 2020: 273-280.
[30] GE Z Y, DEMYANOV S, GARNAVI R. Generative openmax for multi-class open set classification [C]//British Machine Vision Conference. London, UK: British Machine Vision Association Press, 2017.
[31] YANG H M, ZHANG X Y, YIN F, et al. Robust classification with convolutional prototype learning [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 3474-3482.
[32] CHEN G Y, QIAO L M, SHI Y M, et al. Learning open set network with discriminative reciprocal points [C]//16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020: 507-522.
[33] PANAYOTOV V, CHEN G G, POVEY D, et al. Librispeech: An ASR corpus based on public domain audio books [C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia: IEEE, 2015: 5206-5210.
[34] MCAULIFFE M, SOCOLOF M, MIHUC S, et al. Montreal forced aligner: Trainable text-speech alignment using kaldi [C]//18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017: 498-502.
[35] VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE [J]. Journal of Machine Learning Research, 2008, 9(86): 2579-2605.
基金
国家自然科学基金面上项目(62176211);深圳市科创委国际合作研究项目(GJHZ20240218114401004)