Man-machine Speech Communication

Multi-channel speech enhancement network based on speech harmonic structures

  • Alip NURALI 1 ,
  • Rui CAO 1 ,
  • Jin LI 1 ,
  • Meng GE 1 ,
  • Tianrui WANG 1 ,
  • Haoyu WANG 1 ,
  • Zhongjian CUI 1 ,
  • Longbiao WANG , 1, * ,
  • Jianwu DANG 2
Expand
  • 1. Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin 300072, China
  • 2. Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

Received date: 2024-09-11

  Online published: 2025-06-26

Copyright

All rights reserved. Unauthorized reproduction is prohibited.

Abstract

Objective: Multi-channel speech enhancement (MCSE) focuses on recovering clean speech signals by leveraging both spatial relationships and temporal spectral information captured by microphone arrays. MCSE plays a crucial role in applications such as teleconferencing, human-computer interaction, and distant speech recognition, where enhancing speech quality and intelligibility is critical. However, existing approaches often struggle in challenging acoustic environments with low signal-to-noise ratios (SNRs) and high reverberation. While these methods excel at noise reduction and distortion minimization, they often fail to preserve the spatial and acoustic structure of speech in adverse environments. Specifically, methods that rely on networks to implicitly learn joint spectral and spatial information tend to struggle with accurately predicting beamforming weights, which are crucial for noise suppression. This challenge frequently results in speech distortion, such as loss of harmonic structures and decreased speech intelligibility. Methods: To address these issues, this study proposes a two-stage MCSE framework that explicitly incorporates the harmonic structure of speech. The first stage incorporates integrating a speech harmonic information extraction module into a UNet-based beamforming network; this module is designed to capture and retain the harmonic structure, which is fundamental to human auditory perception and speech clarity. The second stage introduces a residual iterative correction module to refine the speech signal; this stage focuses on addressing fiber acoustic structures that were not fully processed during the initial stage. By progressively reducing residual distortions, this approach ensures to improve speech quality, even in environments with low SNRs and significant reverberation, conditions where traditional methods typically fall short. Results: This study evaluated the proposed framework on a dataset derived from LibriSpeech, simulating various challenging acoustic environments with varying SNRs levels and reverberation conditions. The results demonstrated significant improvements over existing MCSE techniques in noise suppression and speech distortion reduction. Specifically, in low SNRs and highly reverberant environments, the beamforming network with harmonic structure information preserved the spatial and spectral characteristics of the speech signal better than traditional methods. Compared with approaches relying solely on implicitly learned spectral-spatial information, the proposed model showed greater effectiveness in retaining speech intelligibility and clarity. The inclusion of harmonic structure information allowed proposed framework to achieve better differentiation between speech and noise, producing more robust and reliable enhancement results under adverse conditions. The residual iterative correction stage further improved model performance by refining unprocessed acoustic structures, thus reducing residual distortions and enhancing spectral richness in the enhanced speech signal. Conclusions: The proposed two-stage MCSE framework successfully addresses the limitations of traditional MCSE methods by explicitly modeling and preserving the harmonic structure of speech. By integrating this information, the framework enhances spatial-spectral processing, enabling superior noise suppression and restoration of speech clarity, particularly in complex acoustic environments. The findings of this study indicate the significance of harmonic structures in human auditory perception and how their preservation contributes to improved intelligibility. The inclusion of the residual iterative correction stage in proposed framework ensures that finer acoustic details are addressed, allowing the model to perform reliably in low SNRs and high reverberation scenarios where conventional approaches often fail. This research underscores the importance of incorporating explicit acoustic structure information into MCSE systems, paving the way for more advanced speech enhancement models that can reliably address acoustic challenges in real-world. The research results of this paper can provide a reference for multi-channel speech enhancement schemes in extreme acoustic scenarios.

Cite this article

Alip NURALI , Rui CAO , Jin LI , Meng GE , Tianrui WANG , Haoyu WANG , Zhongjian CUI , Longbiao WANG , Jianwu DANG . Multi-channel speech enhancement network based on speech harmonic structures[J]. Journal of Tsinghua University(Science and Technology), 2025 , 65(7) : 1328 -1335 . DOI: 10.16511/j.cnki.qhdxxb.2025.26.036

多通道语音增强(multi-channel speech enhancement,MCSE)在语音信号处理领域已成为研究热点,并广泛应用于各种声学场景如视频会议、人机交互和远场自动语音识别(automatic speech recognition,ASR)等[1-2]。与单通道语音增强(single-channel speech enhancement,SCSE)类似,MCSE的目标是增强被背景噪声和混响等干扰导致降质的目标语音。不同之处在于,多通道麦克风阵列提供了空间信息,这使MCSE网络在复杂声学环境中能更好地提升语音质量和可懂度。
MCSE的传统方法是声学波束成形[3-6],利用波束成形器增强目标信号,并抑制干扰信号。随着深度神经网络(deep neural networks,DNN)的发展,神经波束成形器(neural beamforming,NBF)在性能上显著超越了传统空间滤波器。现有方法大致分为3类:第1类通过网络估计语音/噪声掩模,利用传统波束成形器进行空间滤波[7-8],但模块间的优化目标不同,性能提升有限;第2类扩展自SCSE,采用端到端的训练方式,显式或隐式利用空间特征,虽然理论上性能上限更高,但非线性处理可能导致复杂场景下的语音失真[9-10];第3类通过端到端逐帧波束成形器,利用DNN代替部分信号处理以估计波束成形权重。文[11]在时域估计波束成形权重。文[12]提出EaBNet模型,利用包含频谱-空间判别性信息的抽象嵌入特征表示进行逐帧波束成形,验证了频谱-空间特征对逐帧波束成形的重要性。这些方法尽管有效,但是忽略了混响分量,仅考虑了定向声源。
现阶段要实现逐帧全神经波束成形器,频谱-空间模式通常会在非线性特征空间中发生缠绕,缺乏可解释性和内部信息相互作用机制[13]。Tesch等[14]认为,频谱-空间信息处理在增强任务中是相互作用的,增强过程中将二者分开处理会导致性能上限较低,并且丰富的全局频谱信息比其他信息更有助于提高网络的空间选择性。尽管现有方法显式或隐式地将空间信息引入神经网络并进行相应处理,但处理频谱信息却仅停留在短时Fourier变换(short-time Fourier transform,STFT)后的多通道频谱图阶段[15-17]。这极大地限制了MCSE网络在复杂声学环境中的性能上限。
具体来说,在低信噪比(signal-to-noise ratio,SNR)和高混响环境中,语音频谱结构被破坏,若将多通道频谱图简单地引入网络难以充分学习目标语音频谱信息,而单纯依赖空间信息会导致模型性能降低,具体表现为语音谐波结构缺失和语音失真。为解决上述问题,受SCSE对语音声学结构先验建模优异表现的启发[18-19],本文提出了一种基于语音谐波结构的多阶段MCSE网络模型Harmonic BM,该模型通过增强语音频谱结构与空间特征的交互作用,同步实现语音增强和去混响处理。具体而言,在前期阶段,本文受EaBNet进行波束成形方式启发,在UNet波束成形网络中嵌入多尺度声学结构提取模块[20],该模块对频谱处理的中间特征进行多尺度增强后,再传递到解码器单元;在后续阶段,残差迭代修正语音频谱信息,进一步增强未被充分处理的声学结构,缓解语音失真带来的频谱结构信息损失。这种多阶段的处理方式使网络能够显式利用频谱信息,从而获取更多频谱细节,提高模型在不利声学环境下学习频谱-空间联合信息的能力。

1 信号模型

假设x(p)(t)为第p个麦克风收到的带噪混响语音,其中p=0, 1, …, P-1。经过STFT后的MCSE信号模型可以表示如下:
$X_{t, f}=c_f S_{t, f}+V_{t, f}+N_{t, f}=S_{t, f}+V_{t, f}+N_{t, f} .$
其中: $\left\{X_{t, f}, S_{t, f}, V_{t, f}, N_{t, f}\right\} \in \mathbb{C}^{P \times 1}$,分别为在时间索引t∈{1, 2, …, T}上和频率维度f∈{1, 2, …, F}上的混合语音、无混响干净语音、混响分量和噪声分量;$c_f \in \mathbb{C}^{P \times 1}$,为语音的相对传递函数(relative transfer function,RTF); St,f为单通道信号。在不失一般性的情况下,选择首个麦克风作为参考通道,并假设各语音信号中的声源是静态的。与文[21-22]中仅去除方向性噪声不同,本文将噪声分量和混响分量均视为干扰,并将二者去除。

2 Harmonic BM

Harmonic BM为本文提出的总体方法,主题框架如图 1所示,主要由波束成形模块、声学结构提取模块和残差迭代修正模块组成。其中,波束成形模块模拟传统频率波束成形器的行为,并生成滤波权重;声学结构提取模块旨在基于发音机理提取声学结构特征,并在处理过程中扩展通道维度;残差迭代修正模块互相串联共同组成了残差迭代修正网络(residual iterative corrector network,RICN),该网络使用波束成形和声学结构提取模块的输出,即基于频谱先验的声学结构补偿频谱。
图 1 Harmonic BM的主题框架

2.1 波束成形网络中的谐波分布提取器

谐波分布提取器嵌入UNet[12]骨干网络中,提取语音信号中原有的被噪声信号覆盖的谐波结构,以提高网络对目标语音先验声学结构的感知能力。
谐波分布提取器基于文[18]中的谐波分布提取模块进行了修改。当接收到含噪频谱Xi时,卷积模块首先对输入进行初步滤波处理。该模块由因果2D卷积、层归一化和参数修正线性单元等多个组件构成。通过应用包含L个候选基频的梳状-基频转换矩阵Q,谐波分布提取器主动学习计算谐波在频谱中的位置,过程表示如下:
$\begin{aligned}\boldsymbol{H}= & \operatorname{softmax}\left(\operatorname{Conv}\left(X_i^2\right) ·\boldsymbol{Q}^{\mathrm{T}}\right) ·\boldsymbol{Q}, \\& \boldsymbol{H} \in \mathbb{R}^{T \times P \times F}, \boldsymbol{Q} \in \mathbb{R}^{L \times F} .\end{aligned}$
其中HQ的一个重组表示,包含谐波的分布。谐波分布提取器作为特征增强器放置在每一级的编码器(encoder)和解码器(decoder)单元之间,能从被噪声和混响严重污染的频谱中有效发掘原来网络难以学习的频谱细节,随后将输出送到解码器单元,从而实现隐式学习的空间信息与显式输入的频谱信息有效结合,进而提高模型对目标时频单元的辨别能力。

2.2 多尺度加强频谱信息的波束成形

谐波分布提取器被桥接在UNet的每一级编码器和解码器之间,可自然适应多尺度设计。如图 2所示,输入频谱Xi通过平均池化下采样至XiXi$X^{\prime \prime \prime}{ }_i$,随后输入各自尺度的编码器和谐波分布提取器。首先,对频谱图进行二维平均池化,该操作与UNet前半部分的采样步骤操作一致。经过池化处理后,含噪频谱图的大小与经过编码器采样后的特征图大小相匹配;然后,将池化后的N个通道的语音频谱与Ck个特征图进行拼接,得到扩充的N+Ck个输入特征图,这些特征图将作为编码器的输入。同时,不同尺度的频谱也被输入谐波分布提取器中,提取细粒度信息,得到N个特征图。这些特征图经过跳跃连接与编码器的输出进行拼接后,共同输入解码器。UNet结构最终输出包含判别性频谱-空间特征信息的抽象嵌入特征EE通过波束成形权重导出模块得到最后的波束成形权重$\widetilde{M}$,并通过与原始频谱滤波求和得到增强后的语音。整个波束成形网络能够利用不同分辨率的目标语音声学结构感知,并保留所需要的频谱结构,这不仅丰富了频谱信息,还能促进频谱信息与空间信息的充分交互。
图 2 波束成形模块内部架构

2.3 残差迭代修正模块

尽管在先前的波束成形模块中显式引入语音声学结构信息在很大程度上能够提高模型的信息融合能力和对语音先验结构的感知能力,进而有效保留大部分的目标语音,但额外引入频谱信息的单次波束成形操作难以完美恢复语音的幅度和相位。为在波束成形模块增强后的语音基础上进行进一步的频谱修复,本文引入残差迭代修正模块[20]。具体来说,RICN利用来自声学结构提取模块输出的声学结构特征来修正和补偿来自波束成形的预增强项$\hat{S}_p$。由图 1可知,残差迭代修正模块由多个互连的子修正器组成,这些子修正器共同形成了顺序链配置。假设第i个修正器Ri的输出结果为$\hat{S}_{f_i}$,给定提取的声学结构特征A和经过前(i-1)次修正后的残余频谱$\hat{S}_p-\sum\limits_{j=0}^{i-1} \hat{S}_{f_j}$Ri能够通过Q识别谐波位置,从而使用A填充并补偿这些特定的未被前面步骤充分增强的谐波结构,处理过程表示如下
$P_i=\operatorname{HI}\left\{\operatorname{BN}\left(\operatorname{Conv}\left(\operatorname{Concat}\left(A, \hat{S}_p-\sum\limits_{j=0}^{i-1} \hat{S}_{f_j}\right)\right)\right), \boldsymbol{Q}\right\} .$
其中:HI(·)、BN(·)和Concat(·)分别为谐波积分[18]、批归一化和拼接;$\hat{S}_{f_j}$为第j个修正器的输出。为加强不同维度(时间、频率和通道维度)之间的信息交互,进一步使用频率-通道重组(frequency-channel recombination,FCR)和DPRNN[23]再次调整频谱,总体流程表示如下:
$\begin{aligned}& \hat{S}_{f_i}=R_i\left(A, \hat{S}_p-\sum\limits_{j=0}^{i-1} \hat{S}_{f_j}\right)= \\& \operatorname{Conv}\left(\operatorname{DPRNN}\left(\operatorname{FCR}\left(P_i\right)\right)\right.\end{aligned}$
此外,本文引入减法路径(S路径)和加法路径(A路径) 作为2条前向计算路径来实现残差迭代方式。S路径是$\hat{S}_p$$\hat{S}_{f_j}$之间建立的残差连接,通过$\hat{S}_p-\sum\limits_{j=0}^{i-1} \hat{S}_{f_j}$执行减法运算。A路径通过$\hat{S}_p$+ $\sum\limits_{i=1}^N \hat{S}_{f_i}$执行加法运算。通过从$\hat{S}_p$中滤除修正后的频谱特征$\hat{S}_{f_j}$,S路径使每个修正器能够逐步关注$\hat{S}_p$和修正器无法捕获的剩余频谱。

3 试验方法与基线

3.1 数据库

本文基于开源语料库LibriSpeech[24]生成多通道含噪-干净语音对,其中train-clean-100、dev-clean和test-clean分别用于训练、验证和模型评估。噪声数据来源于DNS-Challenge,并从中随机选择约20 000个噪声样本用于训练。多通道房间响应(room impulse responses,RIRs)在一个具有6个麦克风的均匀线性阵列(uniform linear array, ULA)上通过图像法[25]生成。相邻麦克风之间的距离为0.05 m。房间长宽高尺寸从5.00 m×5.00 m×3.00 m到10.00 m×10.00 m×4.00 m随机变化。混响时间范围是0.10~0.70 s。目标噪声源到麦克风的距离从0.50 m到5.00 m不等,间隔为0.50 m,以使训练后的模型适应更广泛的声学场景。在试验配置中,到达方向(direction of arrival,DOA)用于量化目标信号与噪声源相对6个麦克风ULA的角度差异。不同DOA的声源确保目标和噪声源之间的角度差异至少为5°,以增强模型区分不同声源的能力。训练数据SNR从[-6 dB,6 dB]中随机选择。本文分别生成了40 000对和4 000对数据分别用于训练和验证。每一条语音的时长为4.00 s。测试数据集的噪声语音从MUSAN[26]中选取约50种不可见的环境噪声,并设置4种SNR条件(-10、-5、-2和0 dB), 每种条件包含200个测试对。

3.2 配置

在波束成形模块中,2D-GLU和UNet模块的卷积核大小分别为(1, 3)和(2, 3)。通道数均为64。声学结构提取模块和每个修正器的通道号分别为{6, 12} 和{6, 2}。卷积模块和谐波积分中的卷积核大小分别为(5, 2)和(1, 3)。所有卷积的步幅均为(1, 1)。修正器数目根据文[20]中的最佳结果选择3个。
所有语音数据长度均为6.00 s,并以16 kHz的采样率进行采样。采用0.02 s的汉宁窗口,帧间重叠率为50%。采用320点快速Fourier变换(fast Fourier transform),在频率维度上具有161-D特征。该模型使用Adam优化器[27]进行训练。本文引入了时域尺度不变SNR(scale-invariant SNR,SI-SNR)[28]和基于人类心理声学感知的语音质量感知评估指标(perceptual metric for speech quality evaluation,PMSQE)[29]的组合作为整体损失函数,表示如下:
$\mathcal{L}=\mathcal{L}_{\mathrm{SI-SNR}}+\mathcal{L}_{\mathrm{PMSQE}} .$
本文复现了其他先进的基线方法并进行了比较,包括NSF[30]、FasNet-TAC[11]、FT-JNF[13]、EaBNet[12]、TaEr[31]和Oracle MVDR。

4 结果与分析

4.1 方法对比

与近年来先进的MCSE基线方法相比,Harmonic BM表现卓越,具体比较结果如表 1所示。表中采用了多种评价指标进行比较,包括感知语音质量评估(perceptual evaluation of speech quality,PESQ)[32],短时客观可懂度(short-time objective intelligibility,STOI)[33],尺度不变信噪失真比(scale-invariant signal-to-distortion ratio,SI-SDR)[34] 3种。由表 1可知,随着混合语音SNR的增大,各系统指标基本呈上升趋势,低SNR的声学环境会严重影响语音增强系统的性能。Harmonic BM在所有评估指标上均超越了其他基线方法。以SNR等于-10 dB为例,Harmonic BM在PESQ、STOI和SI-SDR指标上的得分分别为3.25、0.93和8.41,明显优于其他方法。这一对比结果表明:相较于在不利的声学条件下隐式学习频谱-空间联合信息,显式引入语音先验频谱信息能够显著提升模型获取信息的丰富度,并促进频谱和空间信息之间的有效交互。
表 1 Harmonic BM与其他基线方法对比
方法 参数量/106 SNR 均值
-10 dB -5 dB -2 dB 0 dB
Noisy 1.41/0.35/-10.59 1.51/0.40/-5.65 1.60/0.43/-2.11 1.60/0.51/-0.42 1.53/0.42/-4.69
NSF 12.96 2.66/0.87/4.99 2.85/0.89/6.27 2.93/0.90/6.84 2.97/0.91/7.17 2.85/0.89/6.32
FasNet-TAC 2.76 2.37/0.85/5.32 2.67/0.89/7.53 2.81/0.90/8.40 2.90/0.91/8.91 2.69/0.89/7.54
FT-JNF 3.35 2.70/0.88/6.11 2.82/0.90/7.19 2.88/0.91/7.73 2.92/0.91/8.81 2.83/0.90/7.46
EaBNet 2.82 2.83/0.90/6.95 3.30/0.92/8.31 3.13/0.93/8.91 3.18/0.93/9.24 3.11/0.92/8.35
TaEr 6.03 2.84/0.91/7.23 3.28/0.92/8.82 3.31/0.92/9.45 3.40/0.93/9.98 3.20/0.92//8.87
Oracle MVDR 2.12/0.69/7.55 2.32/0.73/8.10 2.47/0.77/9.00 2.51/0.83/9.99 2.36/0.75/8.66
Harmonic BM 3.33 3.25/0.93/8.41 3.40/0.94/9.71 3.47/0.94/10.30 3.50/0.95/10.61 3.41/0.94/9.76

注:以上指标数据以PESQ∕STOI∕SI-SDR格式表示,数值越大表示语音质量越高。

得益于显式引入了频谱声学结构信息(尤其是谐波信息),Harmonic BM在PESQ指标上取得了显著进步,这有效提高了模型对语音中显著的周期性声学结构的敏感度,从而大幅提升了语音与人耳听觉感知的一致性。
本文对Harmonic BM模型进行了消融研究,对比了移除修正模块(1)、移除修正模块基础上再移除谐波分布提取器(2)和只移除谐波分布提取器(3)这3种情况,结果如表 2所示。如果从Harmonic BM中只移除谐波分布提取器,会发现所有指标数据均有所下降,这是因为当频谱信息受到严重噪声干扰时,网络难以从中提取到有效的频谱信息,导致频谱信息与空间信息的融合变得困难,进而降低了整体频谱-空间联合信息的丰富度。尽管引入残差迭代修正模块进行频谱补偿,但由于波束成形模块的输出中语音已经发生了失真,因此无法完全恢复消失的声学结构,性能上只有一定程度的提升。这一发现进一步验证了显式引入频谱信息的重要性。
表 2 Harmonic BM模型消融结果
模型状况 参数量/106 PESQ STOI SI-SDR
原模型 3.33 3.41 0.94 9.76
1 2.95 3.34 0.92 9.06
2 2.82 3.12 0.92 8.35
3 3.20 3.15 0.93 8.84

4.2 可视化分析

为深入分析语音先验声学结构信息对神经波束成形的作用,本文可视化了各基线方法和Harmonic BM的语音频谱图。由图 3可知,相比于Harmonic BM,其他方法都有一定程度上的噪声过抑制或抑制不足的问题。EaBNet虽然在常见的声学环境下取得了很好的增强性能,但在SNR为-10 dB的低SNR环境下,如图 3d所示,出现了黄色圆圈标记的现象,这是因为网络的过分去噪导致原本的谐波状声学结构缺失。同样的情况在NSF中也存在。而FasNet-TAC虽然保留了谐波结构,但去噪能力非常有限,这在图 3f蓝色圆圈中体现得尤为明显。可见频谱信息的显式引入不仅增强了模型的频谱-空间信息获取能力和去噪能力,而且提高了对重要的声学结构的敏感度,这使频谱信息在去噪和防失真能力上取得了理想的平衡。
图 3 各基线方法与Harmonic BM语音频谱图对比

5 结论

本文提出了一种基于语音谐波结构的2阶段MCSE网络,该网络通过显式引入频谱谐波结构信息,显著提升了复杂声学环境下的语音增强效果。结果表明:在低SNR等不利条件下,Harmonic BM在PESQ、STOI和SI-SDR这3项关键指标上的均值分别达到3.41、0.94和9.76,相比于其他基线方法,性能提升幅度显著,尤其在-10 dB极低SNR场景下仍能保持稳定的谐波结构恢复能力。这一成果验证了显式建模语音谐波信息的重要性:谐波结构作为语音周期性特征的核心表征,不仅与人耳听觉感知机制高度契合,还能为网络提供强先验约束,从而缓解传统方法因隐式学习导致的频谱-空间信息缠绕问题。通过2阶段协同设计,该网络在波束成形阶段有效提取并保留谐波细节,在残差迭代修正阶段进一步填补未被充分处理的频谱结构,实现了去噪与保真的动态平衡,为复杂场景下的语音交互系统(如远场语音识别和智能会议设备)提供了更稳定的增强方案。
本研究仍存在一定局限性:首先,谐波分布提取器依赖于固定基频候选集,对非平稳语音或快速变化的声学环境适应性不足;其次,模型计算复杂度较高,实时性仍需优化以满足实际部署需求。后续工作将聚焦以下方向:1) 设计动态基频估计模块,结合语音发音机理实现谐波结构的自适应建模;2) 探索轻量化网络架构和模型压缩技术,提升算法在边缘设备中的运行效率,推动语音增强技术向智能化、普适化方向发展。
1
YU J W, WU B, GU R Z, et al. Audio-visual multi-channel recognition of overlapped speech[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 3496-3500.

2
WU B, YU M, CHEN L W, et al. Distortionless multi- channel target speech enhancement for overlapped speech recognition[EB/OL]. (2020-07-03)[2024-09-11]. https://arxiv.org/abs/2007.01566.

3
HEYMANN J, DRUDE L, HAEB-UMBACH R. Neural network based spectral mask estimation for acoustic beamforming[C]//Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016: 196-200.

4
HEYMANN J, DRUDE L, CHINAEV A, et al. BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge[C]//Proceedings of 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. Scottsdale, USA: IEEE, 2015: 444-451.

5
WANG Z Q , WANG P D , WANG D L . Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28, 1778- 1787.

DOI

6
OCHIAI T , WATANABE S , HORI T , et al. Unified architecture for multichannel end-to-end speech recognition with neural beamforming[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11 (8): 1274- 1288.

DOI

7
ERDOGAN H, HERSHEY J R, WATANABE S, et al. Improved MVDR beamforming using single-channel mask prediction networks[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016: 1981-1985.

8
XIAO X, XU C, ZHANG Z, et al. A study of learning based beamforming methods for speech recognition[C]//Proceedings of the CHiME 2016 Workshop on Speech Processing in Everyday Environments. New York, USA: IEEE, 2016: 26-31.

9
GU R Z, WU J, ZHANG S X, et al. End-to-end multi-channel speech separation[EB/OL]. (2019-05-28)[2024-09-11]. https://arxiv.org/abs/1905.06286.

10
LIU W Z , LI A D , WANG X , et al. A neural beamspace-domain filter for real-time multi-channel speech enhancement[J]. Symmetry, 2022, 14 (6): 1081.

DOI

11
LUO Y, CHEN Z, MESGARANI N, et al. End-to-end microphone permutation and number invariant multi-channel speech separation[C]//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE, 2020: 46-50.

12
LI A D, LIU W Z, ZHENG C S, et al. Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement[C]//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6487-6491.

13
LI A D, MENG W X, YU G C, et al. TaylorBeamixer: Learning Taylor-inspired all-neural multi-channel speech enhancement from beam-space dictionary perspective[C]//Proceedings of the 24th Annual Conference of the International Speech Communication Association. Dublin, Ireland: ISCA, 2023: 1055-1059.

14
TESCH K , GERKMANN T . Insights into deep non-linear filters for improved multi-channel speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31, 563- 575.

DOI

15
TZIRAKIS P, KUMAR A, DONLEY J. Multi-channel speech enhancement using graph neural networks[C]//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada: IEEE, 2021: 3415-3419.

16
LI A D, CHEN R L, GU Y, et al. Opine: Leveraging a optimization-inspired deep unfolding method for multi- channel speech enhancement[C]//Proceedings of 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Seoul, South of Korea: IEEE, 2024: 11376-11380.

17
LV S B, FU Y H, JV Y K, et al. Spatial-DCCRN: DCCRN equipped with frame-level angle feature and hybrid filtering for multi-channel speech enhancement[C]//Proceedings of 2022 IEEE Spoken Language Technology Workshop. Doha, Qatar: IEEE, 2023: 436-443.

18
WANG T R, ZHU W B, GAO Y Y, et al. HGCN: Harmonic gated compensation network for speech enhancement[C]//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 371-375.

19
WANG T R , ZHU W B , GAO Y Y , et al. Harmonic attention for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31, 2424- 2436.

DOI

20
CAO R, WANG T R, GE M, et al. VoiCor: A residual iterative voice correction framework for monaural speech enhancement[C]//Proceedings of Interspeech 2024. Kos Island, Greece: ISCA, 2024: 4858-4862.

21
ZHANG Z H, XU Y, YU M, et al. ADL-MVDR: All deep learning MVDR beamformer for target speech separation[C]//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada: IEEE, 2021: 6089-6093.

22
XU Y, ZHANG Z H, YU M, et al. Generalized spatio-temporal RNN beamformer for target speech separation[C]//Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA, 2021: 3076-3080.

23
LUO Y, CHEN Z, YOSHIOKA T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE, 2020: 46-50.

24
PANAYOTOV V, CHEN G G, POVEY D, et al. Librispeech: An ASR corpus based on public domain audio books[C]//Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia: IEEE, 2015: 5206-5210.

25
ALLEN J B , BERKLEY D A . Image method for efficiently simulating small-room acoustics[J]. The Journal of the Acoustical Society of America, 1979, 65 (4): 943- 950.

DOI

26
SNYDER D, CHEN G G, POVEY D. MUSAN: A music, speech, and noise corpus[EB/OL]. (2015-10-28)[2024- 09-11]. https://arxiv.org/abs/1510.08484.

27
KINGMA D P, BA J. Adam: A method for stochastic optimization[C]//International Conference on Learning Representations (ICLR). San Diego, USA: ICLR, 2015.

28
ISIK Y ROUX J L, CHEN Z, et al. Single-channel multi- speaker separation using deep clustering[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016: 545-549.

29
MARTIN-DOÑAS J M , GOMEZ A M , GONZALEZ J A , et al. A deep learning loss function based on the perceptual evaluation of the speech quality[J]. IEEE Signal Processing Letters, 2018, 25 (11): 1680- 1684.

DOI

30
TAN K , WANG Z Q , WANG D L . Neural spectrospatial filtering[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30, 605- 621.

DOI

31
LI A D , YU G C , ZHENG C S , et al. A general unfolding speech enhancement method motivated by Taylor's theorem[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31, 3629- 3646.

DOI

32
RIX A W, BEERENDS J G, HOLLIER M P, et al. Perceptual evaluation of speech quality (PESQ): A new method for speech quality assessment of telephone networks and codecs[C]//2001 IEEE International Conference onAcoustics, Speech, and Signal Processing. Salt Lake City, USA: IEEE, 2001: 749-752.

33
JENSEN J , TAAL C H . An algorithm for predicting the intelligibility of speech masked by modulated noise maskers[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24 (11): 2009- 2022.

DOI

34
LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR- half-baked or well done?[C]//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK: IEEE, 2019: 626-630.

Outlines

/