학술논문

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
Document Type
Working Paper
Source
Subject
Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Language
Abstract
Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates decision-making capabilities of LLMs through the lens of Game Theory. We focus specifically on games that support the simultaneous participation of more than two agents. We introduce GAMA($\gamma$)-Bench, which evaluates LLMs' Gaming Ability in Multi-Agent environments. $\gamma$-Bench includes eight classical multi-agent games and a scoring scheme specially designed to quantitatively assess LLMs' performance. Leveraging $\gamma$-Bench, we investigate LLMs' robustness, generalizability, and strategies for enhancement. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we evaluate twelve versions from six models, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. We find that Gemini-1.5-Pro outperforms other models with a score of $63.8$ out of $100$, followed by LLaMA-3.1-70B and GPT-4 with scores of $60.9$ and $60.5$, respectively. The code and experimental results are made publicly available via https://github.com/CUHK-ARISE/GAMABench.
Comment: 11 pages of main text. 20 pages of appendices. 12 figures, 9 tables. Added models: Gemini-1.5-Pro, LLaMA-3.1-{7, 70, 405}B, Mixtral-8x{7, 22}B, Qwen-2-72B