KOR

e-Article

Cascaded Speech Separation Denoising and Dereverberation Using Attention and TCN-WPE Networks for Speech Devices
Document Type
Periodical
Source
IEEE Internet of Things Journal IEEE Internet Things J. Internet of Things Journal, IEEE. 11(10):18047-18058 May, 2024
Subject
Computing and Processing
Communication, Networking and Broadcast Technologies
Noise reduction
Reverberation
Speech recognition
Time-domain analysis
Microphones
Training
Time-frequency analysis
Dereverberation
sound source separation and denoising
TCN-WPE
two-step network speech enhancement
Language
ISSN
2327-4662
2372-2541
Abstract
In an actual indoor acoustic environment, the signal processing technique to extract the low-noise and low-reverberation speech signal of a particular speaker from the mixed audio signals is crucial for the back-end speech recognition, speech emotion perception judgment, voiceprint recognition, and other artificial intelligence systems that can be used for IoT connectivity. This article proposes a method solving the problem of speaker source separation in the presence of both noise and reverberation by using two-step networks for separation, denoising and dereverberation. To ensure the high quality of the input signal used in the dereverberation stage, various separation networks, such as Sepfomer, were used as training targets with different signals types, and exhibited good convergence in the training set, an obvious separation effect in the validation set, and good generalization. An improved dereverberation method based on time convolution network (TCN-WPE) is proposed. This method uses various improvement strategies, such as employing the scale-invariant signal-to-distortion ratio (SISDR) as the network loss function, adopting a transposition mechanism for the input signal, and employing an additional residual mechanism in the network unit, and significantly improves the dereverberation compared with the traditional WPE and DNN-WPE. The Sepformer with the best separation and denoising effect was cascaded with the TCN-transposed-residual. The experiments confirmed that the proposed method can achieve high-quality speaker speech separation and enhancement within a limited corpus, which enables it to be used in IoT-oriented applications, such as automatic speech recognition systems.