학술논문

Modular Conformer Training for Flexible End-to-End ASR
Document Type
Conference
Source
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023 - 2023 IEEE International Conference on. :1-5 Jun, 2023
Subject
Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Training
Convolution
Error analysis
Acoustics
Decoding
Speech processing
Standards
Automatic speech recognition
self-attention
submodels
Language
ISSN
2379-190X
Abstract
The state-of-the-art conformer used in automatic speech recognition combines feed-forward, convolution and multi-headed self-attention layers in a single model that is trained end-to-end with a decoder network. While this end-to-end training is simple and beneficial for word error rate, it restricts the ability to perform inference with the model at different operating points of word error rate and latency. Existing approaches to overcome this limitation include cascaded encoders and variable attention context models. We propose an alternative approach, called Modular Conformer training, which splits the Conformer model into a backbone convolutional model and attention submodels, which are added at each layer. We conduct experiments with a few training techniques on the Librispeech and Librilight corpus. We show that dropping-out the attention layers during the training of the backbone model allows for the largest WER improvements upon adding fine-tuned attention submodels, without impacting the WER of the backbone model itself.