Multi-channel speech enhancement network based on speech harmonic structures

Alip NURALI, Rui CAO, Jin LI, Meng GE, Tianrui WANG, Haoyu WANG, Zhongjian CUI, Longbiao WANG, Jianwu DANG

Journal of Tsinghua University(Science and Technology) ›› 2025, Vol. 65 ›› Issue (7) : 1328-1335.

PDF(6809 KB)
PDF(6809 KB)
Journal of Tsinghua University(Science and Technology) ›› 2025, Vol. 65 ›› Issue (7) : 1328-1335. DOI: 10.16511/j.cnki.qhdxxb.2025.26.036
Man-machine Speech Communication

Multi-channel speech enhancement network based on speech harmonic structures

Author information +
History +

Abstract

Objective: Multi-channel speech enhancement (MCSE) focuses on recovering clean speech signals by leveraging both spatial relationships and temporal spectral information captured by microphone arrays. MCSE plays a crucial role in applications such as teleconferencing, human-computer interaction, and distant speech recognition, where enhancing speech quality and intelligibility is critical. However, existing approaches often struggle in challenging acoustic environments with low signal-to-noise ratios (SNRs) and high reverberation. While these methods excel at noise reduction and distortion minimization, they often fail to preserve the spatial and acoustic structure of speech in adverse environments. Specifically, methods that rely on networks to implicitly learn joint spectral and spatial information tend to struggle with accurately predicting beamforming weights, which are crucial for noise suppression. This challenge frequently results in speech distortion, such as loss of harmonic structures and decreased speech intelligibility. Methods: To address these issues, this study proposes a two-stage MCSE framework that explicitly incorporates the harmonic structure of speech. The first stage incorporates integrating a speech harmonic information extraction module into a UNet-based beamforming network; this module is designed to capture and retain the harmonic structure, which is fundamental to human auditory perception and speech clarity. The second stage introduces a residual iterative correction module to refine the speech signal; this stage focuses on addressing fiber acoustic structures that were not fully processed during the initial stage. By progressively reducing residual distortions, this approach ensures to improve speech quality, even in environments with low SNRs and significant reverberation, conditions where traditional methods typically fall short. Results: This study evaluated the proposed framework on a dataset derived from LibriSpeech, simulating various challenging acoustic environments with varying SNRs levels and reverberation conditions. The results demonstrated significant improvements over existing MCSE techniques in noise suppression and speech distortion reduction. Specifically, in low SNRs and highly reverberant environments, the beamforming network with harmonic structure information preserved the spatial and spectral characteristics of the speech signal better than traditional methods. Compared with approaches relying solely on implicitly learned spectral-spatial information, the proposed model showed greater effectiveness in retaining speech intelligibility and clarity. The inclusion of harmonic structure information allowed proposed framework to achieve better differentiation between speech and noise, producing more robust and reliable enhancement results under adverse conditions. The residual iterative correction stage further improved model performance by refining unprocessed acoustic structures, thus reducing residual distortions and enhancing spectral richness in the enhanced speech signal. Conclusions: The proposed two-stage MCSE framework successfully addresses the limitations of traditional MCSE methods by explicitly modeling and preserving the harmonic structure of speech. By integrating this information, the framework enhances spatial-spectral processing, enabling superior noise suppression and restoration of speech clarity, particularly in complex acoustic environments. The findings of this study indicate the significance of harmonic structures in human auditory perception and how their preservation contributes to improved intelligibility. The inclusion of the residual iterative correction stage in proposed framework ensures that finer acoustic details are addressed, allowing the model to perform reliably in low SNRs and high reverberation scenarios where conventional approaches often fail. This research underscores the importance of incorporating explicit acoustic structure information into MCSE systems, paving the way for more advanced speech enhancement models that can reliably address acoustic challenges in real-world. The research results of this paper can provide a reference for multi-channel speech enhancement schemes in extreme acoustic scenarios.

Key words

multi-channel speech enhancement / beamforming / spatial-filtering

Cite this article

Download Citations
Alip NURALI , Rui CAO , Jin LI , et al . Multi-channel speech enhancement network based on speech harmonic structures[J]. Journal of Tsinghua University(Science and Technology). 2025, 65(7): 1328-1335 https://doi.org/10.16511/j.cnki.qhdxxb.2025.26.036

References

1
YU J W, WU B, GU R Z, et al. Audio-visual multi-channel recognition of overlapped speech[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA, 2020: 3496-3500.
2
WU B, YU M, CHEN L W, et al. Distortionless multi- channel target speech enhancement for overlapped speech recognition[EB/OL]. (2020-07-03)[2024-09-11]. https://arxiv.org/abs/2007.01566.
3
HEYMANN J, DRUDE L, HAEB-UMBACH R. Neural network based spectral mask estimation for acoustic beamforming[C]//Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016: 196-200.
4
HEYMANN J, DRUDE L, CHINAEV A, et al. BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge[C]//Proceedings of 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. Scottsdale, USA: IEEE, 2015: 444-451.
5
WANG Z Q , WANG P D , WANG D L . Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28, 1778- 1787.
6
OCHIAI T , WATANABE S , HORI T , et al. Unified architecture for multichannel end-to-end speech recognition with neural beamforming[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11 (8): 1274- 1288.
7
ERDOGAN H, HERSHEY J R, WATANABE S, et al. Improved MVDR beamforming using single-channel mask prediction networks[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016: 1981-1985.
8
XIAO X, XU C, ZHANG Z, et al. A study of learning based beamforming methods for speech recognition[C]//Proceedings of the CHiME 2016 Workshop on Speech Processing in Everyday Environments. New York, USA: IEEE, 2016: 26-31.
9
GU R Z, WU J, ZHANG S X, et al. End-to-end multi-channel speech separation[EB/OL]. (2019-05-28)[2024-09-11]. https://arxiv.org/abs/1905.06286.
10
LIU W Z , LI A D , WANG X , et al. A neural beamspace-domain filter for real-time multi-channel speech enhancement[J]. Symmetry, 2022, 14 (6): 1081.
11
LUO Y, CHEN Z, MESGARANI N, et al. End-to-end microphone permutation and number invariant multi-channel speech separation[C]//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE, 2020: 46-50.
12
LI A D, LIU W Z, ZHENG C S, et al. Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement[C]//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6487-6491.
13
LI A D, MENG W X, YU G C, et al. TaylorBeamixer: Learning Taylor-inspired all-neural multi-channel speech enhancement from beam-space dictionary perspective[C]//Proceedings of the 24th Annual Conference of the International Speech Communication Association. Dublin, Ireland: ISCA, 2023: 1055-1059.
14
TESCH K , GERKMANN T . Insights into deep non-linear filters for improved multi-channel speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31, 563- 575.
15
TZIRAKIS P, KUMAR A, DONLEY J. Multi-channel speech enhancement using graph neural networks[C]//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada: IEEE, 2021: 3415-3419.
16
LI A D, CHEN R L, GU Y, et al. Opine: Leveraging a optimization-inspired deep unfolding method for multi- channel speech enhancement[C]//Proceedings of 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Seoul, South of Korea: IEEE, 2024: 11376-11380.
17
LV S B, FU Y H, JV Y K, et al. Spatial-DCCRN: DCCRN equipped with frame-level angle feature and hybrid filtering for multi-channel speech enhancement[C]//Proceedings of 2022 IEEE Spoken Language Technology Workshop. Doha, Qatar: IEEE, 2023: 436-443.
18
WANG T R, ZHU W B, GAO Y Y, et al. HGCN: Harmonic gated compensation network for speech enhancement[C]//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 371-375.
19
WANG T R , ZHU W B , GAO Y Y , et al. Harmonic attention for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31, 2424- 2436.
20
CAO R, WANG T R, GE M, et al. VoiCor: A residual iterative voice correction framework for monaural speech enhancement[C]//Proceedings of Interspeech 2024. Kos Island, Greece: ISCA, 2024: 4858-4862.
21
ZHANG Z H, XU Y, YU M, et al. ADL-MVDR: All deep learning MVDR beamformer for target speech separation[C]//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada: IEEE, 2021: 6089-6093.
22
XU Y, ZHANG Z H, YU M, et al. Generalized spatio-temporal RNN beamformer for target speech separation[C]//Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA, 2021: 3076-3080.
23
LUO Y, CHEN Z, YOSHIOKA T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE, 2020: 46-50.
24
PANAYOTOV V, CHEN G G, POVEY D, et al. Librispeech: An ASR corpus based on public domain audio books[C]//Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia: IEEE, 2015: 5206-5210.
25
ALLEN J B , BERKLEY D A . Image method for efficiently simulating small-room acoustics[J]. The Journal of the Acoustical Society of America, 1979, 65 (4): 943- 950.
26
SNYDER D, CHEN G G, POVEY D. MUSAN: A music, speech, and noise corpus[EB/OL]. (2015-10-28)[2024- 09-11]. https://arxiv.org/abs/1510.08484.
27
KINGMA D P, BA J. Adam: A method for stochastic optimization[C]//International Conference on Learning Representations (ICLR). San Diego, USA: ICLR, 2015.
28
ISIK Y ROUX J L, CHEN Z, et al. Single-channel multi- speaker separation using deep clustering[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016: 545-549.
29
MARTIN-DOÑAS J M , GOMEZ A M , GONZALEZ J A , et al. A deep learning loss function based on the perceptual evaluation of the speech quality[J]. IEEE Signal Processing Letters, 2018, 25 (11): 1680- 1684.
30
TAN K , WANG Z Q , WANG D L . Neural spectrospatial filtering[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30, 605- 621.
31
LI A D , YU G C , ZHENG C S , et al. A general unfolding speech enhancement method motivated by Taylor's theorem[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31, 3629- 3646.
32
RIX A W, BEERENDS J G, HOLLIER M P, et al. Perceptual evaluation of speech quality (PESQ): A new method for speech quality assessment of telephone networks and codecs[C]//2001 IEEE International Conference onAcoustics, Speech, and Signal Processing. Salt Lake City, USA: IEEE, 2001: 749-752.
33
JENSEN J , TAAL C H . An algorithm for predicting the intelligibility of speech masked by modulated noise maskers[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24 (11): 2009- 2022.
34
LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR- half-baked or well done?[C]//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton, UK: IEEE, 2019: 626-630.

RIGHTS & PERMISSIONS

All rights reserved. Unauthorized reproduction is prohibited.
PDF(6809 KB)

Accesses

Citation

Detail

Sections
Recommended

/