학술논문

COCO-NUT: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-Based Control

Document Type

Conference

Author

Watanabe, Aya; Takamichi, Shinnosuke; Saito, Yuki; Nakata, Wataru; Xin, Detai; Saruwatari, Hiroshi

Source

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Automatic Speech Recognition and Understanding Workshop (ASRU), 2023 IEEE. :1-8 Dec, 2023

Subject

Signal Processing and Analysis
Crowdsourcing
Quality assurance
Annotations
Conferences
Manuals
Benchmark testing
Speech synthesis
speech dataset
voice characteristics
text prompt
crowdsourcing

Language

Abstract

In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive and complicated control of voice characteristics. A sufficiently large corpus of high-quality and diverse voice samples with corresponding free-form descriptions can advance such control research. However, neither an open corpus nor a scalable method is currently available. To this end, we develop Coco-Nut, a new corpus including diverse Japanese utterances, along with text transcriptions and free-form voice characteristics descriptions. Our methodology to construct this corpus consists of 1) automatic collection of voice-related audio data from the Internet, 2) quality assurance, and 3) manual annotation using crowdsourcing. Additionally, we benchmark our corpus on the prompt embedding model trained by contrastive speech-text learning.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송