학술논문

CMAST: Efficient Speech-Text Joint Training Method to Enhance Linguistic Features Learning of Speech Representations
Document Type
Conference
Source
2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP) Chinese Spoken Language Processing (ISCSLP), 2024 IEEE 14th International Symposium. :656-660 Nov, 2024
Subject
Computing and Processing
Signal Processing and Analysis
Training
Representation learning
Limiting
Intent recognition
Training data
Linguistics
Speech enhancement
Data models
Automatic speech recognition
self-supervised learning
speech-text joint training
speech representation linguistic feature learning
Language
Abstract
Self-supervised pre-training on speech modality has exhibited promising performance in diverse domains. However, self-supervised learned speech representations often fail to sufficiently highlight linguistic content, limiting their effectiveness in situations that rely heavily on linguistic information, such as intent classification and automatic speech recognition tasks. To address this limitation, we introduce an efficient speech-text joint training method, referred to as CMAST, which leverages cross-modal alignment between speech and text pre-trained models. With a limited amount of paired speech and text data, the pre-trained model's ability to capture linguistic content is significantly enhanced. We conduct extensive evaluations on the SUPERB platform to verify the effectiveness of our approach. The results show that our CMAST model achieves superior performance over previous speech pre-trained models on a series of downstream tasks. In contrast to state-of-the-art speech-text joint pre-trained models, our CMAST model presents comparable performance while requiring fewer parameters and less training data.