학술논문

Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling
Document Type
Periodical
Source
IEEE Access Access, IEEE. 12:31409-31421 2024
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Vocoders
Computational modeling
Training
Acoustics
Redundancy
Text detection
Speech recognition
Generative adversarial networks
Text-to-speech
Audio-visual systems
Assistive technologies
Digital TV
End-to-end text-to-speech
fully-connected layer-based upsampling
iSTFTNet
multi-stream HiFi-GAN
neural vocoder
Language
ISSN
2169-3536
Abstract
Although end-to-end (E2E) text-to-speech (TTS) models with HiFi-GAN-based neural vocoder (e.g. VITS and JETS) can achieve human-like speech quality with fast inference speed, these models still have room to further improve the inference speed with a CPU for practical implementations because HiFi-GAN-based neural vocoder unit is a bottleneck. Additionally, HiFi-GAN is widely used not only for TTS but also for many speech and audio applications. To accelerate HiFi-GAN while maintaining the synthesis quality, Multi-stream (MS)-HiFi-GAN, iSTFTNet and MS-iSTFT-HiFi-GAN have been proposed. Although inverse short-term Fourier transform (iSTFT)-based fast upsampling is introduced in iSTFTNet and MS-iSTFT-HiFi-GAN, we first find that the predicted intermediate features input to the iSTFT layer are completely different from the original STFT spectra due to the redundancy of the overlap-add operation in iSTFT. To further improve the synthesis quality and inference speed, we propose FC-HiFi-GAN and MS-FC-HiFi-GAN by introducing trainable fully-connected (FC) layer-based fast upsampling without overlap-add operation instead of the iSTFT layer. The experimental results for unseen speaker synthesis and E2E TTS conditions show that the proposed methods can slightly accelerate the inference speed and significantly improve the synthesis quality in JETS-based E2E TTS than iSTFTNet and MS-iSTFT-HiFi-GAN. Therefore, the iSTFT layer can be replaced by the proposed trainable FC layer-based upsampling without overlap-add operation in HiFi-GAN-based neural vocoders.