基于有限训练数据和开放集学习的鲁棒小型关键词检测系统

引用本文

黄子峻, 张晓雷. 基于有限训练数据和开放集学习的鲁棒小型关键词检测系统[J]. 清华大学学报(自然科学版), 2024, 64(11): 1927-1935.

HUANG Zijun, ZHANG Xiaolei. Open-set learning for a robust small-footprint keyword spotting systemwith limited training data[J]. Journal of Tsinghua University (Science and Technology), 2024, 64(11): 1927-1935.

基于有限训练数据和开放集学习的鲁棒小型关键词检测系统

黄子峻^1,2, 张晓雷^1,2

1. 西北工业大学航海学院, 西安 710072;
2. 西北工业大学深圳研究院, 深圳 518057

收稿日期：2023-12-23

基金项目：国家自然科学基金面上项目(62176211);深圳市科创委国际合作研究项目(GJHZ20240218114401004)

作者简介：黄子峻(2000—), 男, 硕士研究生

通讯作者：张晓雷, 教授, E-mail: xiaolei.zhang@nwpu.edu.cn

摘要：关键词检测旨在从语音中检测出待识别的关键词, 深度神经网络为小型关键词检测任务提供了有效的解决方案。大多数现有关键词检测方法采用Softmax最小化交叉熵损失函数, 假设测试和训练样本来自相同分布, 侧重于在训练集上最大化分类精度, 而未考虑训练集外的未知语音。若训练数据有限, 关键词检测系统在遇到未知语音时, 实现鲁棒性和高准确率仍比较困难。该文研究了开放集学习方法, 结合深度特征编码器和基于卷积原型学习、互斥点学习的分类器, 用于开放集关键词检测任务。该文提出的关键词检测方法不仅提高了关键词的分类精度, 而且具有较好的非关键词检测性能。在Google Speech Commands数据集V0.01和V0.02, 以及由LibriSpeech衍生的LibriWords数据集上的试验结果表明：该文提出的关键词检测方法在大多数评估指标上优于基线方法。

关键词：有限训练数据关键词检测开放集识别原型学习

Open-set learning for a robust small-footprint keyword spotting systemwith limited training data

HUANG Zijun^1,2, ZHANG Xiaolei^1,2

1. School of Marine Science and Technology, Northwestern Polytechnical University, Xi'an 710072, China;
2. Shenzhen Research Institute, Northwestern Polytechnical University, Shenzhen 518057, China

Abstract: [Objective] Keyword spotting (KWS) aims to detect recognizable keywords from speech. Deep neural networks have provided effective solutions for KWS in small-scale applications. However, most KWS methods employ Softmax-based cross-entropy loss, assuming that the test and training samples have identical distributions. These methods focus on maximizing the classification accuracy of the training set, often neglecting unknown speech data outside the training samples. This approach can lead to significant challenges in real-world scenarios where limited training data is available and individuals frequently encounter unfamiliar speech. [Methods] This paper introduces a approach to KWS by exploring open-set learning methods that can accommodate the open vocabulary of KWS tasks. These methods combine deep feature encoders with classifiers based on convolutional prototype learning and reciprocal point learning. For convolutional prototype learning, this paper first replaces the Softmax network with the prototype network to eliminate the closed-world assumption. Subsequently, constructs prototypes for each keyword that represent class-level features in the feature space. This paper uses a distance-based method to represent the similarity between the sample and the keyword for classification, maximizing the likelihood probability of the sample. To effectively reject non-keywords, this paper applies a regularization constraint on the boundary of the prototypes, which improves the robustness of the system. For reciprocal point learning, this paper constructs reciprocal points that represent features not associated with the keyword class. This paper assumes that the probability of a sample belonging to a keyword is proportional to the distance between this point and the reciprocal point, and uses this as a classification criterion. To detect non-keywords, this paper restricts the boundary range of reciprocal points. In addition, this paper explores variants of reciprocal point learning, such as adversarial reciprocal point learning, which uses a more effective distance function and an adequate boundary constraint to further improve system performance. The backbone network used for training the small-footprint KWS systems is ResNet 15. The KWS system developed from these methods not only enhances the classification accuracy but also improves the detection of non-keyword categories. This paper employs classification accuracy (ACC), macro-averaged F₁ score, and area under the receiver operating characteristic curve (AUC) to measure the performance of the proposed methods. [Results] This paper conducted experiments on Google Speech Command (GSC) datasets V0.01 and V0.02, as well as the LibriWords dataset derived from LibriSpeech, to evaluate the performance of the proposed method. The results showed that the proposed method outperforms the baseline approaches in most evaluation metrics. The proposed method, which was grounded on reciprocal point learning, achieved the best performance in terms of classification ACC. In addition, methods based on generalized convolution prototype learning and adversarial reciprocal point learning equaled or even surpassed the performance of the baseline methods. When detecting non-keywords, the method based on adversarial reciprocal point learning exhibited the best performance on the GSC dataset. As the number of non-keywords in the LibriWords dataset increases, the method employing generalized convolutional prototype loss achieved optimal detection performance. [Conclusions] By introducing generalized convolution prototype learning and reciprocal point learning, this paper significantly improves the performance of the KWS system in open scenarios. The experimental results show that the proposed method significantly outperforms existing approaches on small-footprint systems with limited training data.

Key words: limited training data keywork spotting open set recognition prototype learning

关键词检测(keyword spotting，KWS)又称口语词汇检测(spoken term detection，STD)，是一项从语音中检测预定义关键词的任务，用于移动电话或智能设备中的语音助手等智能代理服务。近年来，基于深度神经网络(deep neural network，DNN)的KWS^[1]性能显著优于传统方法，它将KWS视为音频分类任务，通过训练DNN模型，预测预定义关键词的后验概率，在该模型的Softmax输出层中，每个神经元对应一个关键词，另外，还使用一个填充神经元代表所有非关键词片段。这种基于分类的方法在关键词/填充类的隐Markov模型(hidden Markov model，HMM)中得到显著改进。此后，学者们探索了一些基于分类的方法^[2-6]来减小系统的内存占用量。然而，由于Softmax交叉熵损失具有闭集性，因此上述方法需要收集大量的非关键词片段作为训练样本来实现系统应用的鲁棒性。另外，使用单个填充神经元代表所有非关键词片段，忽略了非关键词的多样性，影响系统性能。

文[7-12]将度量学习引入KWS系统，通过学习适当的距离函数来衡量样本间的相似性或差异性，并通过映射关系将输入样本投影至特征空间，增大不同类、减小同类特征向量之间的差异性。然而，直接在KWS系统中应用度量学习，未考虑目标关键词是预定义的，且具有固定的先验知识，导致KWS系统性能明显下降。为解决该问题，Huh等^[8]提出了一种带有固定目标类别的角度原型网络来增强系统对非关键词片段的鲁棒性，并使用额外的支持向量机(support vector machine, SVM)作最终决策。Xu等^[12]提出了一种带有置信度决策的最大化接收机操作特征曲线面积(area under the receiver operating characteristic curve，AUC)的多类AUC优化(multi-class AUC optimization)方法，用于增大关键词与非关键词之间的距离，然而，为确保KWS系统性能，需要在训练过程中加入大量的非目标关键词样本。

本文受开放集识别问题^[13-15]的启发，基于卷积原型学习^[14](convolutional prototype learning，CPL)和互斥点学习^[15](reciprocal point learning，RPL)，建立了2种高准确率、具有鲁棒性的小型KWS方法。CPL采用开放性的卷积原型网络取代封闭性的Softmax分类器，最大化已知样本的对数似然概率，并用原型损失作为判别学习的潜在正则化约束；RPL是对各已知样本的差异性进行建模，从而学习更紧凑、更具判别性的表征。RPL摆脱了封闭性Softmax交叉熵损失的限制(即所有类别的输出概率总和应为1的前提)。而且，本文所提方法在训练过程中无须引入额外的非目标关键词，减少了训练数据量。在GSC(Google Speech Commands)数据集V0.01^[16]和V0.02^[17]，以及由LibriSpeech衍生的LibriWords数据集中，比较了本文所提2种方法与基于Softmax交叉熵损失^[3]、原型损失^[8]、三元组损失^[9]和多类AUC损失^[12]的KWS方法。本文所提方法可实现对关键词与非关键词的充分分离，为建立有效识别关键词和拒绝非关键词的语音识别系统提供参考。

1 研究现状 1.1 KWS

近年来，随着深度学习技术的进步，DNN应用于KWS研究。Sainath等^[18]利用卷积神经网络(convolutional neural network，CNN)进行KWS。文[19-20]利用深度可分卷积(depthwise separable convolution，DS-Conv)来减小模型的尺寸，并保持模型的准确性，研究结果表明，使用残差连接显著提高了KWS系统的性能^[3]。Mittermaier等^[20]提出了一种结合sinc卷积和DS-Conv的新KWS系统。Choi等^[21]提出了一种具有时序卷积的ResNet架构变体来增大语音特征的感受野，并降低计算成本。上述研究均是基于闭集假设下的分类目标对网络进行训练，在存在未知词汇的开放集KWS方面性能较差。

度量学习目标损失函数(如三元组损失^[22]、对比损失^[23])已广泛应用于人脸^[24-25]和声纹识别^[26-27]领域。上述方法将输入信号映射至特征空间来增大类间、减小类内方差。许多研究将度量学习应用于KWS任务中，Parnami等^[28]使用原型网络解决了少量样本的KWS问题；Seth等^[29]将连续语音KWS视为不平衡分类问题，使用原型损失与度量损失的组合损失处理目标类和非目标类；Huh等^[8]提出了一种基于角度原型的损失和使用SVM的推断方法；Vygon等^[9]将度量学习应用于用户自定义的KWS任务中；Xu等^[12]提出基于多类AUC优化和置信度决策的KWS方法，降低了后端处理的计算成本。

1.2 开放集识别

开放集识别旨在正确区分已学习的目标类别，同时能拒绝未知的非目标类别。Bendale等^[13]将深度学习引入开放集识别中，证明由Softmax层得到的概率阈值不能产生具有鲁棒性的开放模型，并提出了利用OpenMax对激活向量的距离建模来检测未知类。Ge等^[30]扩展了OpenMax，提出generative OpenMax，利用生成模型生成的未知样本训练DNN。然而，上述方法仍受到Softmax分类器封闭性的限制。Yang等^{[14, 31]}提出了CPL，用面向开放场景的卷积原型网络替换具有闭集性的Softmax分类器，根据样本与原型的相似度学习类别特征。Chen等^{[15, 32]}从原型聚类的反方向出发，提出了RPL，根据样本与互斥点之间的差异信息，对样本进行判别。上述方法在图像的开放集识别任务中性能较好。本文针对开放集场景下的语音KWS任务，引入CPL和RPL思想，使网络学习更紧凑的关键词类别特征，并构建了2种新型鲁棒轻量级KWS系统。

2 算法 2.1 CPL

本文使用CNN作为特征提取器f(x; θ)，其中x和θ分别为CNN的原始输入和网络参数。传统CNN将学习到的声学特征向量通过Softmax层进行线性分类，而CPL^[14]为每个关键词类别创建了多个可学习的特定关键词特征原型，用于特征原型分类。假设有C个类别，每个类别有K个特征原型，f(x; θ)和特征原型m_ij(i类的第j个原型) 在网络中进行联合训练。在分类阶段，首先通过特征原型匹配将对象进行分类；然后通过计算Euclidean距离找出距样本最近的特征原型，并将样本分至该原型对应的关键词类别。

假设给定x，首先通过f(x; θ)获取特征向量；然后将该向量与所有特征原型进行比较，并将其分至距离最近的特征原型所属的类别，表示如下：

$ \begin{equation*} x \in \text { class } \mathop {\arg \max }\limits_{i = 1}^C g_{i}(f(x ; \theta)) . \end{equation*} $

(1)

其中g_i(f(x; θ))为预测特征向量为第i类的概率。

在CPL系统框架中，可训练的参数包括2部分：θ和每个类别中的特征原型M_w={m_ij|i=1, 2, …, C, j=1, 2, …, K}。参数θ和M_w以端到端的方式进行联合训练，使其能更好地相互配合，提高分类性能。

为训练该系统，需要定义相应的损失函数。另外，损失函数应对θ和M_w可导，并与分类准确度密切相关。本文采用基于距离的交叉熵损失(distance based cross entropy loss，DCE)作为损失函数。

在卷积原型网络框架下，距离可衡量样本与特征原型之间的相似度。样本(x, y)属于特征原型m_ij的概率p(x∈m_ij|x)可用它们之间的距离d_c(f(x; θ), m_ij)衡量，表示如下：

$ \begin{equation*} p\left(x \in m_{i j} \mid x\right)=\frac{\mathrm{e}^{-\gamma d_{\mathrm{c}}\left(f(x ; \theta), m_{i j}\right)}}{\sum\limits_{i=1}^{C} \sum\limits_{j=1}^{K} \mathrm{e}^{-\gamma d_{\mathrm{c}}\left(f(x ; \theta), m_{i j}\right)}} \end{equation*}. $

(2)

其中：$d_{\mathrm{c}}\left(f(x ; \theta), m_{i j}\right)=\left\|f(x ; \theta)-m_{i j}\right\|_{2}^{2} ; \gamma$为控制概率分布松弛的超参数。单个样本的后验概率p(y|x)表示为

$ \begin{equation*} p(y \mid x)=\sum\limits_{j=1}^{K} p\left(x \in m_{y j} \mid x\right) \end{equation*}. $

(3)

对于S个带标签的关键词样本集合D_L={(x₁, y₁), (x₂, y₂), …, (x_S, y_S)}，其中y_i∈{1, 2, …, C}为x_i的真实标签，分类交叉熵损失l_DCE表示如下：

$ \begin{equation*} l_{\mathrm{DCE}}=-\frac{1}{S} \sum\limits_{s=1}^{S} \ln \frac{\sum\limits_{j=1}^{K} \mathrm{e}^{-\gamma d_{\mathrm{c}}\left(f\left(x_{s} ; \theta\right), m_{y_{s}, j}\right)}}{\sum\limits_{i=1}^{C} \sum\limits_{j=1}^{K} \mathrm{e}^{-\gamma d_{\mathrm{c}}\left(f\left(x_{s} ; \theta\right), m_{i j}\right)}} \end{equation*}. $

(4)

其中m_{y_s, j}为样本x_s所属的y_s类上，距特征f(x_s; θ)最近的原型。

2.2 广义CPL

广义CPL(generalized CPL, GCPL)^[14]在CPL和DCE的基础上，使用了更充分的约束来提升分类性能。DCE能提升模型的分类准确率，由于直接最小化分类损失容易使网络过拟合，且不能拒绝非关键词类，因此，本文使用原型损失^[14]作为正则化约束来提高卷积原型网络的泛化性能。原型损失函数l_PL表示如下：

$ \begin{equation*} l_{\mathrm{PL}}=\frac{1}{S} \sum\limits_{s=1}^{S}\left\|f\left(x_{s} ; \theta\right)-m_{y_{s}, j}\right\|_{2}^{2} \end{equation*}. $

(5)

本文结合原型损失和分类交叉熵损失对网络进行训练，广义卷积原型损失函数l_GCPL表示如下：

$ \begin{equation*} l_{\mathrm{GCPL}}=l_{\mathrm{DCE}}+\lambda l_{\mathrm{PL}} . \end{equation*} $

(6)

其中λ为控制原型损失权重的超参数。原型损失的本质为最大似然的正则化约束。

2.3 RPL

RPL的定义^[32]为：假设关键词m的声学特征空间为S_m, 对应的开放特征空间为$O_{m}=\mathbb{R}^{\mathrm{D}}-$ $S_{m}$，其中$\mathbb{R}^{\mathrm{D}}$为高维特征空间。为更好地管理开放空间的风险，进一步将O_m分为其他关键词空间O_m^pos和非关键词空间O_m^neg。假设m对应的互斥点为P^m={p_i^m|i=1, 2, …, M}，其中M为每个类别互斥点的数量。样本x₁∈O_m比x₂∈S_m在特征空间上更接近P^m，表示如下：

$ \begin{gather*} \forall d_{\mathrm{R}} \in \eta\left(P^{m}, D_{\mathrm{L}}^{m}\right) , \\ \max \left(\zeta\left(P^{m}, D_{\mathrm{L}}^{\neq m} \cup D_{\mathrm{U}}\right) \leqslant d_{\mathrm{R}}\right) . \end{gather*} $

(7)

其中：d_R为样本与互斥点之间的距离；η(P^m, D_L^m) 为样本与互斥点之间的距离集合，D_L^m为关键词m的样本集合；ξ(P^m，D_L^≠m∪D_U)为样本与非互斥点之间的距离集合，D_L^≠m为不含m关键词的样本集合，D_U为非关键词样本集合。本文为更好地区分已知关键词特征空间和未知空间，使用最大化输入特征向量与互斥点之间的距离，通过f(x; θ)来优化每个类的互斥点。给定样本x和P^m，它们之间的距离d₁(f(x; θ), P^m)表示为

$ \begin{equation*} d_{1}\left(f(x ; \theta), P^{m}\right)=\frac{1}{M} \sum\limits_{i=1}^{M}\left\|f(x ; \theta)-P_{i}^{m}\right\|_{2}^{2} . \end{equation*} $

(8)

本文通过计算输入特征f(x; θ)与互斥点之间的距离，表示样本与各类互斥点之间的差异性，进而确定样本属于哪个关键词类。根据互斥点的性质，x属于m类的概率与x与P^m的差异性成正比。这说明d₁(f(x; θ), P^m)越大，越可能将x分至m类关键词。根据概率之和为1的性质，p(y|x)表示为

$ \begin{equation*} p(y \mid x)=\frac{\mathrm{e}^{\gamma d_{1}\left(f(x ; \theta), P^{y}\right)}}{\sum\limits_{i=1}^{N} \mathrm{e}^{\gamma d_{1}\left(f(x ; \theta), P^{y}\right)}} \end{equation*}. $

(9)

其中P^y为m类互斥点集合，P^y={P^y_s|s=1, 2, …, S}，使用负对数概率优化RPL的分类交叉熵损失l_C，表示为

$ \begin{equation*} l_{\mathrm{C}}=-\frac{1}{S} \sum\limits_{s=1}^{S} \ln \frac{\mathrm{e}^{\gamma d_{1}\left(f\left(x_{s} ; \theta\right), P^{y_{s}}\right)}}{\sum\limits_{i=1}^{M} \mathrm{e}^{\gamma d_{1}\left(f\left(x_{s} ; \theta\right), P^{y_{s}}\right)}} \end{equation*}. $

(10)

最大化输入特征与互斥点之间的差异性，有助于增大闭集空间与开放空间的间隔，由于S_m与O_m在特征空间具有互补性，因此约束S_m与P^m之间的距离，可间接约束开放集风险，互斥点损失l_rl表示如下：

$ \begin{equation*} l_{\mathrm{rl}}=\frac{1}{S} \sum\limits_{s=1}^{S}\left\|d_{1}\left(f\left(x_{s} ; \theta\right), P^{y_{s}}\right)-R^{y_{s}}\right\|_{2}^{2} \end{equation*}. $

(11)

其中R^y_s为可学习的间隔变量。最终的损失函数l_rpl表示为

$\begin{equation*} l_{\mathrm{rpl}}=l_{\mathrm{c}}+\alpha l_{\mathrm{r} 1} \end{equation*}. $

(12)

其中α为控制距离损失的超参数。

2.4 对抗RPL

在RPL的基础上，对抗RPL(adversarial RPL，ARPL)^[15]利用与对应互斥点的角度信息，充分评估已知类的特征与对应互斥点的差异性，并采用新的对抗边界约束l_arl控制未知开放空间的风险。新的距离函数d₂(f(x; θ), P^m)和对抗互斥点损失函数l_arpl分别表示如下：

$ d_{2}\left(f(x ; \theta), P^{m}\right)= \\ \frac{1}{M} \sum\limits_{i=1}^{M}\left[\left\|f(x ; \theta)-P_{i}^{m}\right\|_{2}^{2}-f(x ; \theta) \cdot P_{i}^{m}\right], $

(13)

$l_{\mathrm{arl}}=\frac{1}{S} \sum\limits_{s=1}^{S} \max \left(d_{1}\left(f\left(x_{s} ; \theta\right), P^{y_{s}}\right)-R^{y_{s}}, 0\right), $

(14)

$ l_{\text {arpl }}=l_{\mathrm{c}}+\alpha l_{\mathrm{arl}} . $

(15)

3 试验设置 3.1 数据集

为模拟实际场景，使用GSC V0.01^[16]和V0.02^[17]数据集训练和测试本文提出的KWS模型。GSC V0.01包含1 881名说话人的30种口语关键词的64 727条单秒语句。GSC V0.02是V0.01的扩充版本，包含2 618名说话人的35种口语关键词的105 829条单秒语句。另外，2个数据集都包含数分钟的背景噪声文件。GSC V0.01和V0.02都包含一个验证文件和一个测试文件。将验证和测试文件中的音频分别作为验证数据和测试数据，其他音频文件作为训练数据，在训练过程中随机对数据进行时移和加入噪声。本文与其他使用该数据集的研究相同，选择10个目标口语关键词“yes”“no”“up”“down”“left”“right”“on”“off”“stop”和“go”，并添加无语音信号的沉默类和包含其他所有单词的非关键词类。然而，在多数研究的试验设置中，测试集使用的非关键词都在模型训练过程中使用过，这与真实KWS应用场景不一致。本文参考Huh等^[8]的研究，用10个未在训练过程中使用的非关键词进行测试，仍需要在训练过程中加入非关键词样本作为辅助信息，未能满足实际应用需求。为验证本文所提方法能更好地区分关键词与非关键词类，在模型训练过程中不引入非目标关键词样本，即将训练类别外的类全部用于测试。训练和测试数据如表 1所示，GSC的关键词类用于训练和测试，非关键词类有2种分配情况。

表 1 非关键词数据集配置

数据集	情况1	情况2
训练集	“zero” “one” “two” “three” “four” “five” “six” “seven” “eight” “nine”	—
测试集	“bed” “bird” “cat” “dog” “happy” “house” “sheila” “tree” “marvin” “wow”	“zero” “one” “two” “three” “four” “five” “six” “seven” “eight” “nine” “bed” “bird” “cat” “dog” “happy” “house” “sheila” “tree”“marvin” “wow”
注：—表示无内容。

表选项

为验证本文构建的新型鲁棒轻量级KWS模型的泛化性能，需要一个由大量不同单词组成的数据集对模型进行测试。考虑使用包含约1 000 h真实英语演讲片段的数据集LibriSpeech^[33]。由于LibriSpeech只提供语句级的转录，无单词级别的对齐，因此采用Montreal Forced Aligner^[34]单词提取技术，创建包含单个说话词和对应词级标签的数据集LibriWords。本文参考Vygon等^[9]的部署，使用4个版本的数据集LibriWords10、LibriWords100、LibriWords1000、LibriWords10000，对应LibriSpeech中频率最高的前10、100、1 000、10 000个词汇，且不与GSC中的词汇重复。例如，LibriWords10中的单词为“all”“and”“before”“himself”“man”“not”“said”“so”“time”“upon”。

3.2 特征提取

本文使用原始16 kHz波形的40维Mel频率倒谱系数作为输入，其帧长为40 ms，跳长为10 ms，所有wav文件都切分至1 s。参考Tang等^[3]的部署，随机对输入数据做-100~100 ms的时间偏移和噪声添加，噪声源为GSC中的背景噪声。

3.3 评价指标

本文使用准确度(accuracy，ACC)和F₁分数(F₁-score)衡量所提KWS系统的性能。ACC反映输入的测试样本被正确分类的概率；F₁-score通过求调和平均数，将分类器的精度和召回率与单个指标结合，通常用于评估二分类系统，本文使用它的扩展形式macro-F₁评估多分类模型。另外，还用AUC衡量不同阈值下，KWS系统对非关键词的检测性能。

3.4 模型架构

本文首先使用ResNet15^[3]作为骨干网络，ResNet15的起点为不含偏置的卷积层，权重$W \in$ $\mathbb{R}^{h \times w \times n}$，其中h和w分别为卷积核的高度和宽度，n为输出通道的数量；其次，将第一个卷积层的输出作为后续残差块的输入，并输入单独的非残差卷积层；最后，通过平均池化层得到输出。另外，使用扩张卷积增加网络的感受野，并在每个卷积层后添加批归一化层，辅助训练深度网络。

3.5 试验参数

本文在试验中使用Adam优化器训练每个模型的60个epoch。设置初始学习率为0.001 0，并在30个epoch后降至0.000 1。本文使用的批大小为128，权重衰减系数为10^-5。对于原型损失和三元组损失的基线模型，采样策略和超参数与文[10-11]一致，在三元组损失模型中，使用k最近邻算法进行分类；对于多类AUC损失，设置门限超参数δ为0.3。K和M均设为1，α和λ均设为0.1，训练流程如图 1所示，其中σ为激活函数。

图 1 KWS系统流程图

图选项

4 结果与分析

本文所提KWS方法与4种基线方法的比较结果如表 2所示。本文使用的广义卷积原型损失、互斥点损失和对抗互斥点损失在ACC和macro-F₁分数方面均有显著提升，对于绝大多数指标具有较好的性能。首先，本文遵循文[10]的训练策略(见表 1中情况1)，在训练过程中加入非目标关键词，作为非关键词辅助网络训练，以GSC V0.02-12试验结果为例，互斥点损失比三元组损失的ACC分数相对提升了23.1%，macro-F₁分数相对提升了36.4%；其次，在本文提出的新数据集分配策略中，训练中不含非目标关键词，以GSC V0.02-11试验结果为例，互斥点损失比Softmax交叉熵损失的ACC分数相对提升了25.8%，macro-F₁分数相对提升了25.4%，而且该情况下广义卷积原型损失和对抗互斥点损失具有与基线方法相同或更优的性能。本文在GSC V0.01数据集上以相同的设置进行了试验，结果表明：本文所提方法能有效提高KWS系统的分类性能。

表 2 本文所提方法与4种基线方法对比

损失函数	GSC V0.01-11		GSC V0.02-11		GSC V0.01-12		GSC V0.02-12
损失函数	ACC/%	macro-F₁	ACC/%	macro-F₁	ACC/%	macro-F₁	ACC/%	macro-F₁
Softmax交叉熵损失^[3]	96.59	0.963 5	97.44	0.972 1	95.65	0.961 1	95.97	0.964 7
原型损失^[8]	96.69	0.965 1	97.21	0.970 3	95.28	0.937 0	96.58	0.954 1
三元组损失^[9]	93.60	0.931 2	95.45	0.952 4	96.29	0.950 0	96.11	0.947 8
多类AUC损失^[12]	96.66	0.963 3	96.73	0.966 8	95.91	0.960 0	95.62	0.961 2
广义卷积原型损失	97.05	0.968 2	97.50	0.973 0	95.97	0.967 8	95.83	0.962 9
互斥点损失	97.08	0.968 4	98.10	0.979 2	96.59	0.970 3	97.01	0.966 8
对抗互斥点损失	96.82	0.965 9	97.44	0.972 3	95.88	0.952 0	95.97	0.966 0

表选项

本文为进一步验证所提方法的有效性，比较了各方法的鲁棒性，如表 3所示。在GSC V0.01-11和V0.02-11中，训练和测试数据均来自GSC数据集。对抗互斥点损失性能较优，比Softmax交叉熵损失在GSC V0.01-11和V0.02-11数据集中的AUC得分分别相对提升了24.8%和18.4%。为进一步模拟训练和测试数据分布不同的真实环境，将用GSC数据集训练的模型加入LibriWords数据集进行测试，广义卷积原型损失在3个数据集中性能最优。以LibriWords10为例，广义卷积原型损失比原型损失的AUC得分相对提升了12.5%，试验结果表明：基于度量学习的方法比直接使用分类器在拒绝非关键词类方面表现更好，并且广义卷积原型损失和互斥点损失利用基于交叉熵分类和距离度量的优点，能同时胜任分类和检测任务。此外，三元组损失在AUC指标上弱于其他方法，原因是三元组损失未能充分学习样本分布的差异，容易将未知分布的非关键词判为关键词。

表 3 本文所提方法与4种基线方法的AUC得分对比

损失函数	GSC V0.01-11	GSC V0.02-11	LibriWord10	LibriWord100	LibriWord1000	LibriWord10000
Softmax交叉熵损失^[3]	0.899 9	0.905 2	0.926 4	0.922 9	0.926 8	0.921 2
原型损失^[8]	0.873 5	0.894 1	0.923 9	0.932 9	0.932 4	0.938 0
三元组损失^[9]	0.689 8	0.709 9	0.607 3	0.521 6	0.522 2	0.517 6
多类AUC损失^[12]	0.889 4	0.908 7	0.916 1	0.896 3	0.911 7	0.893 9
广义卷积原型损失	0.894 9	0.915 5	0.933 4	0.935 6	0.933 2	0.925 0
互斥点损失	0.874 4	0.912 6	0.886 0	0.888 2	0.901 4	0.898 9
对抗互斥点损失	0.924 7	0.922 6	0.927 9	0.921 4	0.920 4	0.917 5

表选项

本文用t分布随机邻域嵌入(t-distributed stochastic neighbor embedding, t-SNE)^[35]技术对KWS模型中提取的特征向量进行可视化，以V0.01-12为例，用4种目标函数在12分类任务训练后的系统中将各类样本投影至二维空间，每个样本均表示为平面内的点，同类别样本用同种颜色表示，2点距离越近说明二者的分布越接近，越可能为同一类别。如图 2所示。广义卷积原型损失对Softmax交叉熵损失做了原型的正则化约束，在特征空间中使同类特征向量更紧凑；互斥点损失利用不同类的差异性，增大了不同类之间的距离；对抗互斥点损失在互斥点损失的基础上，利用不同类关键词的角度信息，进一步提升了类间的可分离性。本文所提方法能更好地分离各类关键词, 并区分关键词和非关键词。

图 2 特征向量的t-SNE可视化图

图选项

5 结论

本文将广义卷积原型损失和互斥点损失引入开放场景下的KWS任务，提出了2种高准确率、鲁棒性的KWS方法，在保持较高的KWS准确率的同时，显著提高了对非关键词的检测性能。研究结果表明：在无非目标关键词参与训练的情况下，本文所提方法仍能较好地检测出非关键词口语。本文所提方法在GSC和LibriSpeech数据集上与4种基线方法进行比较，结果表明：在小型KWS模型中，本文所提方法在大多数指标上明显优于基线方法。

参考文献

[1]	CHEN G G, PARADA C, HEIGOLD G. Small-footprint keyword spotting using deep neural networks[C]// 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy: IEEE, 2014: 4087-4091.
[2]	ARIK S O, KLIEGL M, CHILD R, et al. Convolutional recurrent neural networks for small-footprint keyword spotting[C]// 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017: 1606-1610.
[3]	TANG R, LIN J. Deep residual learning for small-footprint keyword spotting[C]// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary, Canada: IEEE, 2018: 5484-5488.
[4]	SHAN C H, ZHANG J B, WANG Y J, et al. Attention-based end-to-end models for small-footprint keyword spotting[C]// 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA, 2018: 2037-2041.
[5]	XU M L, ZHANG X L. Depthwise separable convolutional ResNet with squeeze-and-excitation blocks for small-footprint keyword spotting[C]// 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 2547-2551.
[6]	YANG C, WEN X, SONG L M. Multi-scale convolution for robust keyword spotting[C]// 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 2577-2581.
[7]	ZHANG P, ZHANG X L. Deep template matching for small-footprint and configurable keyword spotting[C]// 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 2572-2576.
[8]	HUH J, LEE M, HEO H, et al. Metric learning for keyword spotting[C]// 2021 IEEE Spoken Language Technology Workshop. Shenzhen, China: IEEE, 2021: 133-140.
[9]	VYGON R, MIKHAYLOVSKIY N. Learning efficient representations for keyword spotting with triplet loss[C]// 23rd International Conference on Speech and Computer. St. Petersburg, Russia: Springer, 2021: 773-785.
[10]	JUNG J, KIM Y, PARK J, et al. Metric learning for user-defined keyword spotting[C]// ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island, Greece: IEEE, 2023: 1-5.
[11]	RUSCI M, TUYTELAARS T. Few-shot open-set learning for on-device customization of keyword spotting systems[J/OL]. arXiv. (2023-06-03)[2023-10-01]. https://arxiv.org/abs/2306.02161.
[12]	XU M L, LI S Q, LIANG C D, et al. Multi-class AUC optimization for robust small-footprint keyword spotting with limited training data[C]// 23rd Annual Conference of the International Speech Communication Association. Incheon, South of Korea: ISCA, 2022: 3278-3282.
[13]	BENDALE A, BOULT T E. Towards open set deep networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 1563-1572.
[14]	YANG H M, ZHANG X Y, YIN F, et al. Convolutional prototype network for open set recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(5): 2358-2370.
[15]	CHEN G Y, PENG P X, WANG X Q, et al. Adversarial reciprocal points learning for open set recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(11): 8065-8081.
[16]	WARDEN P. Speech commands: A public dataset for single-word speech recognition [DB/OL]. [2023-10-01]. http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz.
[17]	WARDEN P. Speech commands: A dataset for limited- vocabulary speech recognition[DB/OL]. [2023-10-01]. http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz.
[18]	SAINATH T N, PARADA C. Convolutional neural networks for small-footprint keyword spotting[C]// 16th Annual Conference of the International Speech Communication Association. Dresden, Germany: ISCA, 2015: 1478-1482.
[19]	ZHANG Y D, SUDA N, LAI L Z, et al. Hello edge: Keyword spotting on microcontrollers[J/OL]. arXiv. (2017-11-20) [2023-10-01]. https://arxiv.org/abs/1711.07128.
[20]	MITTERMAIER S, KVRZINGER L, WASCHNECK B, et al. Small-footprint keyword spotting on raw audio data with sinc-convolutions[C]// ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE, 2020: 7454-7458.
[21]	CHOI S, SEO S, SHIN B, et al. Temporal convolution for real-time keyword spotting on mobile devices[C]// 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA, 2019: 3372-3376.
[22]	HOFFER E, AILON N. Deep metric learning using triplet network[C]// Third International Workshop on Similarity-Based Pattern Recognition. Copenhagen, Denmark: Springer, 2015: 84-92.
[23]	CHOPRA S, HADSELL R, LECUN Y. Learning a similarity metric discriminatively, with application to face verification[C]// 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005: 539-546.
[24]	PARKHI O M, VEDALDI A, ZISSERMAN A. Deep face recognition[C]// Proceedings of the British Machine Vision Conference 2015. Swansea, UK: British Machine Vision Association Press, 2015: 1-12.
[25]	SCHROFF F, KALENICHENKO D, PHILBIN J. FaceNet: A unified embedding for face recognition and clustering[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 815-823.
[26]	NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: A large-scale speaker identification dataset[C]// 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017: 2616-2620.
[27]	ZHANG C L, KOISHIDA K. End-to-end text-independent speaker verification with triplet loss on short utterances[C]// 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017: 1487-1491.
[28]	PARNAMI A, LEE M. Few-shot keyword spotting with prototypical networks[C]// 2022 7th International Conference on Machine Learning Technologies. Rome, Italy: ACM, 2022: 277-283.
[29]	SETH H, KUMAR P, SRIVASTAVA M M. Prototypical metric transfer learning for continuous speech keyword spotting with limited training data[C]// 14th International Conference on Soft Computing Models in Industrial and Environmental Applications. Seville, Spain: Springer, 2020: 273-280.
[30]	GE Z Y, DEMYANOV S, GARNAVI R. Generative openmax for multi-class open set classification[C]// British Machine Vision Conference. London, UK: British Machine Vision Association Press, 2017.
[31]	YANG H M, ZHANG X Y, YIN F, et al. Robust classification with convolutional prototype learning[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE, 2018: 3474-3482.
[32]	CHEN G Y, QIAO L M, SHI Y M, et al. Learning open set network with discriminative reciprocal points[C]// 16th European Conference on Computer Vision. Glasgow, UK: Springer, 2020: 507-522.
[33]	PANAYOTOV V, CHEN G G, POVEY D, et al. Librispeech: An ASR corpus based on public domain audio books[C]// 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia: IEEE, 2015: 5206-5210.
[34]	MCAULIFFE M, SOCOLOF M, MIHUC S, et al. Montreal forced aligner: Trainable text-speech alignment using kaldi[C]// 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA, 2017: 498-502.
[35]	VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9(86): 2579-2605.

文章信息

工作空间