학술논문

Wav2Seq: Pre-Training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages
Document Type
Conference
Source
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023 - 2023 IEEE International Conference on. :1-5 Jun, 2023
Subject
Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Training
Text recognition
Self-supervised learning
Signal processing
Data models
Acoustics
Task analysis
Self-supervision
Encoder-decoder
Speech recognition
Spoken named enitiy recognition
Speech translation
Language
ISSN
2379-190X
Abstract
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task — transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 8 language pairs for speech-to-text translation, even when competing methods use additional text data for training. On ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.