학술논문

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Document Type

Working Paper

Author

Gehrmann, Sebastian; Bhattacharjee, Abhik; Mahendiran, Abinaya; Wang, Alex; Papangelis, Alexandros; Madaan, Aman; McMillan-Major, Angelina; Shvets, Anna; Upadhyay, Ashish; Yao, Bingsheng; Wilie, Bryan; Bhagavatula, Chandra; You, Chaobin; Thomson, Craig; Garbacea, Cristina; Wang, Dakuo; Deutsch, Daniel; Xiong, Deyi; Jin, Di; Gkatzia, Dimitra; Radev, Dragomir; Clark, Elizabeth; Durmus, Esin; Ladhak, Faisal; Ginter, Filip; Winata, Genta Indra; Strobelt, Hendrik; Hayashi, Hiroaki; Novikova, Jekaterina; Kanerva, Jenna; Chim, Jenny; Zhou, Jiawei; Clive, Jordan; Maynez, Joshua; Sedoc, João; Juraska, Juraj; Dhole, Kaustubh; Chandu, Khyathi Raghavi; Perez-Beltrachini, Laura; Ribeiro, Leonardo F. R.; Tunstall, Lewis; Zhang, Li; Pushkarna, Mahima; Creutz, Mathias; White, Michael; Kale, Mihir Sanjay; Eddine, Moussa Kamal; Daheim, Nico; Subramani, Nishant; Dusek, Ondrej; Liang, Paul Pu; Ammanamanchi, Pawan Sasanka; Zhu, Qi; Puduppully, Ratish; Kriz, Reno; Shahriyar, Rifat; Cardenas, Ronald; Mahamood, Saad; Osei, Salomey; Cahyawijaya, Samuel; Štajner, Sanja; Montella, Sebastien; Shailza; Jolly, Shailza; Mille, Simon; Hasan, Tahmid; Shen, Tianhao; Adewumi, Tosin; Raunak, Vikas; Raheja, Vipul; Nikolaev, Vitaly; Tsai, Vivian; Jernite, Yacine; Xu, Ying; Sang, Yisi; Liu, Yixin; Hou, Yufang

Source

Subject

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Language

Abstract

Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송