학술논문

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Document Type

Working Paper

Author

Zhou, Kun; Zhao, Shengkui; Ma, Yukun; Zhang, Chong; Wang, Hao; Ng, Dianwen; Ni, Chongjia; Hieu, Nguyen Trung; Yip, Jia Qi; Ma, Bin

Source

Subject

Electrical Engineering and Systems Science - Audio and Speech Processing
Computer Science - Computation and Language
Computer Science - Sound

Language

Abstract

Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method.
Comment: Accepted by Interspeech 2024

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송