학술논문
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Document Type
Working Paper
Author
Botev, Aleksandar; De, Soham; Smith, Samuel L; Fernando, Anushan; Muraru, George-Cristian; Haroun, Ruba; Berrada, Leonard; Pascanu, Razvan; Sessa, Pier Giuseppe; Dadashi, Robert; Hussenot, Léonard; Ferret, Johan; Girgin, Sertan; Bachem, Olivier; Andreev, Alek; Kenealy, Kathleen; Mesnard, Thomas; Hardin, Cassidy; Bhupatiraju, Surya; Pathak, Shreya; Sifre, Laurent; Rivière, Morgane; Kale, Mihir Sanjay; Love, Juliette; Tafti, Pouya; Joulin, Armand; Fiedel, Noah; Senter, Evan; Chen, Yutian; Srinivasan, Srivatsan; Desjardins, Guillaume; Budden, David; Doucet, Arnaud; Vikram, Sharad; Paszke, Adam; Gale, Trevor; Borgeaud, Sebastian; Chen, Charlie; Brock, Andy; Paterson, Antonia; Brennan, Jenny; Risdal, Meg; Gundluru, Raj; Devanathan, Nesh; Mooney, Paul; Chauhan, Nilay; Culliton, Phil; Martins, Luiz Gustavo; Bandy, Elisa; Huntsperger, David; Cameron, Glenn; Zucker, Arthur; Warkentin, Tris; Peran, Ludovic; Giang, Minh; Ghahramani, Zoubin; Farabet, Clément; Kavukcuoglu, Koray; Hassabis, Demis; Hadsell, Raia; Teh, Yee Whye; de Frietas, Nando
Source
Subject
Language
Abstract
We introduce RecurrentGemma, a family of open language models which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both. Our models achieve comparable performance to similarly-sized Gemma baselines despite being trained on fewer tokens.