학술논문

Understanding Shared Speech-Text Representations

Document Type

Conference

Author

Wang, Gary; Kastner, Kyle; Bapna, Ankur; Chen, Zhehuai; Rosenberg, Andrew; Ramabhadran, Bhuvana; Zhang, Yu

Source

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023 - 2023 IEEE International Conference on. :1-5 Jun, 2023

Subject

Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Adaptation models
Visualization
Simultaneous localization and mapping
Inspection
Signal processing
Loss measurement
Task analysis
Speech-Text Representation Learning
Text Injection

Language

ISSN

2379-190X

Abstract

Recently, a number of approaches to train speech models by incorporating text into end-to-end models have been developed, with Maestro advancing state-of-the-art automatic speech recognition (ASR) and Speech Translation (ST) performance. In this paper, we expand our understanding of the resulting shared speech-text representations with two types of analyses. First we examine the limits of speech-free domain adaptation, finding that a corpus-specific duration model for speech-text alignment is the most important component for learning a shared speech-text representation. Second, we inspect the similarities between activations of unimodal (speech or text) encoders as compared to the activations of a shared encoder. We find that the shared encoder learns a more compact and overlapping speech-text representation than the uni-modal encoders. We hypothesize that this partially explains the effectiveness of the Maestro shared speech-text representations.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송