학술논문
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Document Type
Working Paper
Author
Anwar, Usman; Saparov, Abulhair; Rando, Javier; Paleka, Daniel; Turpin, Miles; Hase, Peter; Lubana, Ekdeep Singh; Jenner, Erik; Casper, Stephen; Sourbut, Oliver; Edelman, Benjamin L.; Zhang, Zhaowei; Günther, Mario; Korinek, Anton; Hernandez-Orallo, Jose; Hammond, Lewis; Bigelow, Eric; Pan, Alexander; Langosco, Lauro; Korbak, Tomasz; Zhang, Heidi; Zhong, Ruiqi; hÉigeartaigh, Seán Ó; Recchia, Gabriel; Corsi, Giulio; Chan, Alan; Anderljung, Markus; Edwards, Lilian; Petrov, Aleksandar; de Witt, Christian Schroeder; Motwan, Sumeet Ramesh; Bengio, Yoshua; Chen, Danqi; Torr, Philip H. S.; Albanie, Samuel; Maharaj, Tegan; Foerster, Jakob; Tramer, Florian; He, He; Kasirzadeh, Atoosa; Choi, Yejin; Krueger, David
Source
Subject
Language
Abstract
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.