학술논문

SQuId: Measuring Speech Naturalness in Many Languages

Document Type

Conference

Author

Sellam, Thibault; Bapna, Ankur; Camp, Joshua; Mackinnon, Diana; Parikh, Ankur P.; Riesa, Jason

Source

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023 - 2023 IEEE International Conference on. :1-5 Jun, 2023

Subject

Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Training
Costs
Signal processing
Predictive models
Acoustics
Speech synthesis
Task analysis

Language

ISSN

2379-190X

Abstract

Much of text-to-speech research relies on human evaluation. This incurs heavy costs and slows down the development process, especially in heavily multilingual applications where recruiting and polling annotators can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales—the largest effort of this type to date. The main insight is that training one model on many locales consistently surpasses mono-locale baselines. We show that the model outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, for which there is no fine-tuning data. We highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we present the effect of model size and pre-training diversity with ablation experiments.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송