학술논문

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Document Type

Working Paper

Author

Abbas, Ammar; Bollepalli, Bajibabu; Moinet, Alexis; Joly, Arnaud; Karanasou, Penny; Makarov, Peter; Slangens, Simon; Karlapati, Sri; Drugman, Thomas

Source

Subject

Electrical Engineering and Systems Science - Audio and Speech Processing
Computer Science - Machine Learning
Computer Science - Sound

Language

Abstract

We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale mel-spectrograms capturing fine-grained prosody. We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS where the scales in our system are motivated by the linguistic units. The Word-level MSS models word, phoneme, and frame-level spectrograms while Sentence-level MSS models sentence-level spectrogram in addition. Subjective evaluations show that Word-level MSS performs statistically significantly better compared to the baseline on two voices.
Comment: Accepted for the 11th ISCA Speech Synthesis Workshop (SSW11)

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송