학술논문

U-NET: A Supervised Approach for Monaural Source Separation

Document Type

Original Paper

Author

Basir, Samiul; Hossain, Md. Nahid; Hosen, Md. Shakhawat; Ali, Md. Sadek; Riaz, Zainab; Islam, Md. Shohidul

Source

Arabian Journal for Science and Engineering. 49(9):12679-12691

Subject

Speech separation
Source separation
Short-time Fourier transform (STFT)
U-NET

Language

English

ISSN

2193-567X
2191-4281

Abstract

Separating speech is a challenging area of research, especially when trying to separate the desired source from its combination. Deep learning has arisen as a promising solution, surpassing traditional methods. While prior research has mainly focused on the magnitude, log-magnitude, or a combination of the magnitude and phase portions, a new approach using the Short-time Fourier Transform (STFT), and a deep Convolutional Neural Network named U-NET has been proposed. This method, unlike others, considers both the real and imaginary components for decomposition. During the training stage, the mixed time-domain signal undergoes a transformation into a frequency-domain signal by using STFT, producing a mixed complex spectrogram. The spectrogram’s real and imaginary parts are then divided and combined into a single matrix. The newly formed matrix is fed through U-NET to extract the source components. The same process is repeated at testing. The resulting concatenated matrix for the mixed test signal is passed through the saved model to generate two enhanced concatenated matrices for each source. These matrices are then transformed back into time-domain signals using inverse STFT by extracting the magnitude and phase. The proposed approach has been evaluated using the GRID audio visual corpuses, with results showing improved quality and intelligibility compared to the existing methods, as demonstrated by objective measurement metrics.

Online Access

EBSCOHost PDF Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송