학술논문
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
Document Type
Working Paper
Author
Communication, Seamless; Barrault, Loïc; Chung, Yu-An; Meglioli, Mariano Cora; Dale, David; Dong, Ning; Duquenne, Paul-Ambroise; Elsahar, Hady; Gong, Hongyu; Heffernan, Kevin; Hoffman, John; Klaiber, Christopher; Li, Pengwei; Licht, Daniel; Maillard, Jean; Rakotoarison, Alice; Sadagopan, Kaushik Ram; Wenzek, Guillaume; Ye, Ethan; Akula, Bapi; Chen, Peng-Jen; Hachem, Naji El; Ellis, Brian; Gonzalez, Gabriel Mejia; Haaheim, Justin; Hansanti, Prangthip; Howes, Russ; Huang, Bernie; Hwang, Min-Jae; Inaguma, Hirofumi; Jain, Somya; Kalbassi, Elahe; Kallet, Amanda; Kulikov, Ilia; Lam, Janice; Li, Daniel; Ma, Xutai; Mavlyutov, Ruslan; Peloquin, Benjamin; Ramadan, Mohamed; Ramakrishnan, Abinesh; Sun, Anna; Tran, Kevin; Tran, Tuan; Tufanov, Igor; Vogeti, Vish; Wood, Carleigh; Yang, Yilin; Yu, Bokai; Andrews, Pierre; Balioglu, Can; Costa-jussà, Marta R.; Celebi, Onur; Elbayad, Maha; Gao, Cynthia; Guzmán, Francisco; Kao, Justine; Lee, Ann; Mourachko, Alexandre; Pino, Juan; Popuri, Sravya; Ropers, Christophe; Saleem, Safiyyah; Schwenk, Holger; Tomasello, Paden; Wang, Changhan; Wang, Jeff; Wang, Skyler
Source
Subject
Language
Abstract
What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication