학술논문

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Document Type

Working Paper

Author

Ju, Zeqian; Wang, Yuancheng; Shen, Kai; Tan, Xu; Xin, Detai; Yang, Dongchao; Liu, Yanqing; Leng, Yichong; Song, Kaitao; Tang, Siliang; Wu, Zhizheng; Qin, Tao; Li, Xiang-Yang; Ye, Wei; Zhang, Shikun; Bian, Jiang; He, Lei; Li, Jinyu; Zhao, Sheng

Source

Subject

Electrical Engineering and Systems Science - Audio and Speech Processing
Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Machine Learning
Computer Science - Sound

Language

Abstract

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.
Comment: Achieving human-level quality and naturalness on multi-speaker datasets (e.g., LibriSpeech) in a zero-shot way

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송