학술논문

Understanding Shared Speech-Text Representations
Document Type
Conference
Source
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023 - 2023 IEEE International Conference on. :1-5 Jun, 2023
Subject
Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Adaptation models
Visualization
Simultaneous localization and mapping
Inspection
Signal processing
Loss measurement
Task analysis
Speech-Text Representation Learning
Text Injection
Language
ISSN
2379-190X
Abstract
Recently, a number of approaches to train speech models by incorporating text into end-to-end models have been developed, with Maestro advancing state-of-the-art automatic speech recognition (ASR) and Speech Translation (ST) performance. In this paper, we expand our understanding of the resulting shared speech-text representations with two types of analyses. First we examine the limits of speech-free domain adaptation, finding that a corpus-specific duration model for speech-text alignment is the most important component for learning a shared speech-text representation. Second, we inspect the similarities between activations of unimodal (speech or text) encoders as compared to the activations of a shared encoder. We find that the shared encoder learns a more compact and overlapping speech-text representation than the uni-modal encoders. We hypothesize that this partially explains the effectiveness of the Maestro shared speech-text representations.