基于语音谐波结构的多通道语音增强网络

努尔艾力·阿力甫, 曹瑞, 李津, 葛檬, 王天锐, 王浩宇, 崔中健, 王龙标, 党建武

清华大学学报(自然科学版) ›› 2025, Vol. 65 ›› Issue (7) : 1328-1335.

PDF(6809 KB)
PDF(6809 KB)
清华大学学报(自然科学版) ›› 2025, Vol. 65 ›› Issue (7) : 1328-1335. DOI: 10.16511/j.cnki.qhdxxb.2025.26.036
人机语音通讯

基于语音谐波结构的多通道语音增强网络

作者信息 +

Multi-channel speech enhancement network based on speech harmonic structures

Author information +
文章历史 +

摘要

多通道语音增强(multi-channel speech enhancement,MCSE)的主要目标是利用麦克风阵列捕获的空间关系和时间频谱信息恢复干净的语音信号。尽管现有方法在去噪和减少失真方面表现出色,但在低信噪比(signal-to-noise ratio,SNR)和高混响等不利声学环境下,语音的声学结构和空间信息可能会不完整或模糊。依赖网络隐式学习频谱-空间信息的方法难以准确预测波束成形权重,从而导致语音失真。为解决这一问题,该文设计了一种基于语音谐波结构的2阶段MCSE网络,显式引入频谱谐波结构信息。在第1阶段,在UNet波束成形网络中集成语音信息提取模块,学习并保留谐波结构;在第2阶段,引入残差迭代修正模块,进一步增强在第1阶段未被充分处理的声学结构,从而减少失真。基于LibriSpeech仿真生成的数据集试验结果表明:在不利声学环境下,显式引入声学结构信息的波束成形网络在抗失真性能上显著优于仅依赖隐式学习频谱-空间信息的方法。该文研究结果可为极端声学场景下的多通道语音增强方案提供参考。

Abstract

Objective: Multi-channel speech enhancement (MCSE) focuses on recovering clean speech signals by leveraging both spatial relationships and temporal spectral information captured by microphone arrays. MCSE plays a crucial role in applications such as teleconferencing, human-computer interaction, and distant speech recognition, where enhancing speech quality and intelligibility is critical. However, existing approaches often struggle in challenging acoustic environments with low signal-to-noise ratios (SNRs) and high reverberation. While these methods excel at noise reduction and distortion minimization, they often fail to preserve the spatial and acoustic structure of speech in adverse environments. Specifically, methods that rely on networks to implicitly learn joint spectral and spatial information tend to struggle with accurately predicting beamforming weights, which are crucial for noise suppression. This challenge frequently results in speech distortion, such as loss of harmonic structures and decreased speech intelligibility. Methods: To address these issues, this study proposes a two-stage MCSE framework that explicitly incorporates the harmonic structure of speech. The first stage incorporates integrating a speech harmonic information extraction module into a UNet-based beamforming network; this module is designed to capture and retain the harmonic structure, which is fundamental to human auditory perception and speech clarity. The second stage introduces a residual iterative correction module to refine the speech signal; this stage focuses on addressing fiber acoustic structures that were not fully processed during the initial stage. By progressively reducing residual distortions, this approach ensures to improve speech quality, even in environments with low SNRs and significant reverberation, conditions where traditional methods typically fall short. Results: This study evaluated the proposed framework on a dataset derived from LibriSpeech, simulating various challenging acoustic environments with varying SNRs levels and reverberation conditions. The results demonstrated significant improvements over existing MCSE techniques in noise suppression and speech distortion reduction. Specifically, in low SNRs and highly reverberant environments, the beamforming network with harmonic structure information preserved the spatial and spectral characteristics of the speech signal better than traditional methods. Compared with approaches relying solely on implicitly learned spectral-spatial information, the proposed model showed greater effectiveness in retaining speech intelligibility and clarity. The inclusion of harmonic structure information allowed proposed framework to achieve better differentiation between speech and noise, producing more robust and reliable enhancement results under adverse conditions. The residual iterative correction stage further improved model performance by refining unprocessed acoustic structures, thus reducing residual distortions and enhancing spectral richness in the enhanced speech signal. Conclusions: The proposed two-stage MCSE framework successfully addresses the limitations of traditional MCSE methods by explicitly modeling and preserving the harmonic structure of speech. By integrating this information, the framework enhances spatial-spectral processing, enabling superior noise suppression and restoration of speech clarity, particularly in complex acoustic environments. The findings of this study indicate the significance of harmonic structures in human auditory perception and how their preservation contributes to improved intelligibility. The inclusion of the residual iterative correction stage in proposed framework ensures that finer acoustic details are addressed, allowing the model to perform reliably in low SNRs and high reverberation scenarios where conventional approaches often fail. This research underscores the importance of incorporating explicit acoustic structure information into MCSE systems, paving the way for more advanced speech enhancement models that can reliably address acoustic challenges in real-world. The research results of this paper can provide a reference for multi-channel speech enhancement schemes in extreme acoustic scenarios.

关键词

多通道语音增强 / 波束成形 / 空间滤波

Key words

multi-channel speech enhancement / beamforming / spatial-filtering

引用本文

导出引用
努尔艾力·阿力甫, 曹瑞, 李津, . 基于语音谐波结构的多通道语音增强网络[J]. 清华大学学报(自然科学版). 2025, 65(7): 1328-1335 https://doi.org/10.16511/j.cnki.qhdxxb.2025.26.036
Alip NURALI, Rui CAO, Jin LI, et al. Multi-channel speech enhancement network based on speech harmonic structures[J]. Journal of Tsinghua University(Science and Technology). 2025, 65(7): 1328-1335 https://doi.org/10.16511/j.cnki.qhdxxb.2025.26.036
中图分类号: TP393.1   

参考文献

1
YU J W, WU B, GU R Z, et al. Audio-visual multi-channel recognition of overlapped speech[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 3496-3500.
2
WU B, YU M, CHEN L W, et al. Distortionless multi- channel target speech enhancement for overlapped speech recognition[EB/OL]. (2020-07-03)[2024-09-11]. https://arxiv.org/abs/2007.01566.
3
HEYMANN J, DRUDE L, HAEB-UMBACH R. Neural network based spectral mask estimation for acoustic beamforming[C]//Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016: 196-200.
4
HEYMANN J, DRUDE L, CHINAEV A, et al. BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge[C]//Proceedings of 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. Scottsdale, USA: IEEE, 2015: 444-451.
5
WANG Z Q , WANG P D , WANG D L . Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28, 1778- 1787.
6
OCHIAI T , WATANABE S , HORI T , et al. Unified architecture for multichannel end-to-end speech recognition with neural beamforming[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11 (8): 1274- 1288.
7
ERDOGAN H, HERSHEY J R, WATANABE S, et al. Improved MVDR beamforming using single-channel mask prediction networks[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016: 1981-1985.
8
XIAO X, XU C, ZHANG Z, et al. A study of learning based beamforming methods for speech recognition[C]//Proceedings of the CHiME 2016 Workshop on Speech Processing in Everyday Environments. New York, USA: IEEE, 2016: 26-31.
9
GU R Z, WU J, ZHANG S X, et al. End-to-end multi-channel speech separation[EB/OL]. (2019-05-28)[2024-09-11]. https://arxiv.org/abs/1905.06286.
10
LIU W Z , LI A D , WANG X , et al. A neural beamspace-domain filter for real-time multi-channel speech enhancement[J]. Symmetry, 2022, 14 (6): 1081.
11
LUO Y, CHEN Z, MESGARANI N, et al. End-to-end microphone permutation and number invariant multi-channel speech separation[C]//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE, 2020: 46-50.
12
LI A D, LIU W Z, ZHENG C S, et al. Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement[C]//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6487-6491.
13
LI A D, MENG W X, YU G C, et al. TaylorBeamixer: Learning Taylor-inspired all-neural multi-channel speech enhancement from beam-space dictionary perspective[C]//Proceedings of the 24th Annual Conference of the International Speech Communication Association. Dublin, Ireland: ISCA, 2023: 1055-1059.
14
TESCH K , GERKMANN T . Insights into deep non-linear filters for improved multi-channel speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31, 563- 575.
15
TZIRAKIS P, KUMAR A, DONLEY J. Multi-channel speech enhancement using graph neural networks[C]//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada: IEEE, 2021: 3415-3419.
16
LI A D, CHEN R L, GU Y, et al. Opine: Leveraging a optimization-inspired deep unfolding method for multi- channel speech enhancement[C]//Proceedings of 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Seoul, South of Korea: IEEE, 2024: 11376-11380.
17
LV S B, FU Y H, JV Y K, et al. Spatial-DCCRN: DCCRN equipped with frame-level angle feature and hybrid filtering for multi-channel speech enhancement[C]//Proceedings of 2022 IEEE Spoken Language Technology Workshop. Doha, Qatar: IEEE, 2023: 436-443.
18
WANG T R, ZHU W B, GAO Y Y, et al. HGCN: Harmonic gated compensation network for speech enhancement[C]//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 371-375.
19
WANG T R , ZHU W B , GAO Y Y , et al. Harmonic attention for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31, 2424- 2436.
20
CAO R, WANG T R, GE M, et al. VoiCor: A residual iterative voice correction framework for monaural speech enhancement[C]//Proceedings of Interspeech 2024. Kos Island, Greece: ISCA, 2024: 4858-4862.
21
ZHANG Z H, XU Y, YU M, et al. ADL-MVDR: All deep learning MVDR beamformer for target speech separation[C]//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada: IEEE, 2021: 6089-6093.
22
XU Y, ZHANG Z H, YU M, et al. Generalized spatio-temporal RNN beamformer for target speech separation[C]//Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA, 2021: 3076-3080.
23
LUO Y, CHEN Z, YOSHIOKA T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE, 2020: 46-50.
24
PANAYOTOV V, CHEN G G, POVEY D, et al. Librispeech: An ASR corpus based on public domain audio books[C]//Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia: IEEE, 2015: 5206-5210.
25
ALLEN J B , BERKLEY D A . Image method for efficiently simulating small-room acoustics[J]. The Journal of the Acoustical Society of America, 1979, 65 (4): 943- 950.
26
SNYDER D, CHEN G G, POVEY D. MUSAN: A music, speech, and noise corpus[EB/OL]. (2015-10-28)[2024- 09-11]. https://arxiv.org/abs/1510.08484.
27
KINGMA D P, BA J. Adam: A method for stochastic optimization[C]//International Conference on Learning Representations (ICLR). San Diego, USA: ICLR, 2015.
28
ISIK Y ROUX J L, CHEN Z, et al. Single-channel multi- speaker separation using deep clustering[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016: 545-549.
29
MARTIN-DOÑAS J M , GOMEZ A M , GONZALEZ J A , et al. A deep learning loss function based on the perceptual evaluation of the speech quality[J]. IEEE Signal Processing Letters, 2018, 25 (11): 1680- 1684.
30
TAN K , WANG Z Q , WANG D L . Neural spectrospatial filtering[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30, 605- 621.
31
LI A D , YU G C , ZHENG C S , et al. A general unfolding speech enhancement method motivated by Taylor's theorem[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31, 3629- 3646.
32
RIX A W, BEERENDS J G, HOLLIER M P, et al. Perceptual evaluation of speech quality (PESQ): A new method for speech quality assessment of telephone networks and codecs[C]//2001 IEEE International Conference onAcoustics, Speech, and Signal Processing. Salt Lake City, USA: IEEE, 2001: 749-752.
33
JENSEN J , TAAL C H . An algorithm for predicting the intelligibility of speech masked by modulated noise maskers[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24 (11): 2009- 2022.
34
LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR- half-baked or well done?[C]//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK: IEEE, 2019: 626-630.

版权

版权所有,未经授权,不得转载。
PDF(6809 KB)

Accesses

Citation

Detail

段落导航
相关文章

/