학술논문

VoiceGrad: Non-Parallel Any-to-Many Voice Conversion With Annealed Langevin Dynamics
Document Type
Periodical
Source
IEEE/ACM Transactions on Audio, Speech, and Language Processing IEEE/ACM Trans. Audio Speech Lang. Process. Audio, Speech, and Language Processing, IEEE/ACM Transactions on. 32:2213-2226 2024
Subject
Signal Processing and Analysis
Computing and Processing
Communication, Networking and Broadcast Technologies
General Topics for Engineers
Decoding
Training
Data models
Generators
Diffusion processes
Jacobian matrices
Generative adversarial networks
Voice conversion (VC)
non-parallel VC
any-to-many VC
score matching
Langevin dynamics
diffusion models
Language
ISSN
2329-9290
2329-9304
Abstract
In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad . Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching, Langevin dynamics, and diffusion models. The idea involves training a score approximator, a fully convolutional network with a U-Net structure, to predict the gradient of the log density of the speech feature sequences of multiple speakers. The trained score approximator can be used to perform VC by using annealed Langevin dynamics or reverse diffusion process to iteratively update an input feature sequence towards the nearest stationary point of the target distribution. Thanks to the nature of this concept, VoiceGrad enables any-to-many VC, a VC scenario in which the speaker of input speech can be arbitrary, and allows for non-parallel training, which requires no parallel utterances.