학술논문
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Document Type
Working Paper
Author
Laurençon, Hugo; Saulnier, Lucile; Wang, Thomas; Akiki, Christopher; del Moral, Albert Villanova; Scao, Teven Le; Von Werra, Leandro; Mou, Chenghao; Ponferrada, Eduardo González; Nguyen, Huu; Frohberg, Jörg; Šaško, Mario; Lhoest, Quentin; McMillan-Major, Angelina; Dupont, Gerard; Biderman, Stella; Rogers, Anna; allal, Loubna Ben; De Toni, Francesco; Pistilli, Giada; Nguyen, Olivier; Nikpoor, Somaieh; Masoud, Maraim; Colombo, Pierre; de la Rosa, Javier; Villegas, Paulo; Thrush, Tristan; Longpre, Shayne; Nagel, Sebastian; Weber, Leon; Muñoz, Manuel; Zhu, Jian; Van Strien, Daniel; Alyafeai, Zaid; Almubarak, Khalid; Vu, Minh Chien; Gonzalez-Dios, Itziar; Soroa, Aitor; Lo, Kyle; Dey, Manan; Suarez, Pedro Ortiz; Gokaslan, Aaron; Bose, Shamik; Adelani, David; Phan, Long; Tran, Hieu; Yu, Ian; Pai, Suhas; Chim, Jenny; Lepercq, Violette; Ilic, Suzana; Mitchell, Margaret; Luccioni, Sasha Alexandra; Jernite, Yacine
Source
Subject
Language
Abstract
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.
Comment: NeurIPS 2022, Datasets and Benchmarks Track
Comment: NeurIPS 2022, Datasets and Benchmarks Track