학술논문

Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion

Document Type

Working Paper

Author

Chien, Yung-Lun; Chen, Hsin-Hao; Yen, Ming-Chi; Tsai, Shu-Wei; Wang, Hsin-Min; Tsao, Yu; Chi, Tai-Shih

Source

Subject

Computer Science - Sound
Electrical Engineering and Systems Science - Audio and Speech Processing

Language

Abstract

Electrolarynx is a commonly used assistive device to help patients with removed vocal cords regain their ability to speak. Although the electrolarynx can generate excitation signals like the vocal cords, the naturalness and intelligibility of electrolaryngeal (EL) speech are very different from those of natural (NL) speech. Many deep-learning-based models have been applied to electrolaryngeal speech voice conversion (ELVC) for converting EL speech to NL speech. In this study, we propose a multimodal voice conversion (VC) model that integrates acoustic and visual information into a unified network. We compared different pre-trained models as visual feature extractors and evaluated the effectiveness of these features in the ELVC task. The experimental results demonstrate that the proposed multimodal VC model outperforms single-modal models in both objective and subjective metrics, suggesting that the integration of visual information can significantly improve the quality of ELVC.
Comment: Accepted to INTERSPEECH 2023

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송