학술논문

Improving Speech Prosody of Audiobook Text-To-Speech Synthesis with Acoustic and Textual Contexts

Document Type

Conference

Author

Xin, Detai; Adavanne, Sharath; Ang, Federico; Kulkarni, Ashish; Takamichi, Shinnosuke; Saruwatari, Hiroshi

Source

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023 - 2023 IEEE International Conference on. :1-5 Jun, 2023

Subject

Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Aggregates
Predictive models
Acoustics
Speech synthesis
Feeds
Signal resolution
Context modeling
text-to-speech synthesis
TTS
audiobook
speech prosody
context modeling

Language

ISSN

2379-190X

Abstract

We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder and a textual context encoder to aggregate context information and feeds it to the TTS model, which enables the model to predict context-dependent prosody. We conducted comprehensive objective and subjective evaluations on a multi-speaker Japanese audiobook dataset. Experimental results demonstrate that the proposed method significantly outperforms two previous works. Additionally, we present insights about the different choices of context - modalities, lateral information and length - for audiobook TTS that have never been discussed in the literature before.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송