학술논문

Efficient Domain Adaptation for Speech Foundation Models

Document Type

Conference

Author

Li, Bo; Hwang, Dongseong; Huo, Zhouyuan; Bai, Junwen; Prakash, Guru; Sainath, Tara N.; Chai Sim, Khe; Zhang, Yu; Han, Wei; Strohman, Trevor; Beaufays, Francoise

Source

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2023 - 2023 IEEE International Conference on. :1-5 Jun, 2023

Subject

Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Training
Adaptation models
Frequency modulation
Video on demand
Soft sensors
Speech recognition
Data models
foundation models
domain adaptation

Language

ISSN

2379-190X

Abstract

Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and extend the joint training strategy JUST Hydra for finetuning using both source and unsuper-vised target domain data. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of super-vised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송