Skip to content
Longterm Wiki
Back

AgentBench

paper

Authors

Xiao Liu·Hao Yu·Hanchen Zhang·Yifan Xu·Xuanyu Lei·Hanyu Lai·Yu Gu·Hangliang Ding·Kaiwen Men·Kejuan Yang·Shudan Zhang·Xiang Deng·Aohan Zeng·Zhengxiao Du·Chenhui Zhang·Sheng Shen·Tianjun Zhang·Yu Su·Huan Sun·Minlie Huang·Yuxiao Dong·Jie Tang

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

AgentBench is a multi-dimensional benchmark for evaluating large language models as autonomous agents across 8 interactive environments, directly relevant to AI safety research on agent reliability, reasoning robustness, and decision-making capabilities in complex tasks.

Paper Details

Citations
652
43 influential
Year
2023
Methodology
peer-reviewed
Categories
Proceedings of the AAAI Conference on Artificial I

Metadata

arxiv preprintprimary source

Abstract

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.

Summary

AgentBench is a comprehensive multi-dimensional benchmark designed to evaluate Large Language Models (LLMs) as autonomous agents across 8 distinct interactive environments. The study evaluates both API-based and open-source LLMs, revealing significant performance gaps between top commercial models and open-source alternatives up to 70B parameters. The research identifies key failure modes—poor long-term reasoning, weak decision-making, and inadequate instruction following—and proposes that improvements in instruction following and high-quality multi-round alignment training could enhance agent performance. Notably, the findings challenge conventional assumptions about code training's universal benefits for agent tasks.

Cited by 2 pages

PageTypeQuality
Long-Horizon Autonomous TasksCapability65.0
Minimal ScaffoldingCapability52.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
\\useunder

\\ul\\doparttoc\\faketableofcontents

# AgentBench: Evaluating LLMs as Agents

Xiao Liu
Tsinghua University
Hao Yu
Tsinghua University
Hanchen Zhang
Tsinghua University
Yifan Xu
Tsinghua University
Xuanyu Lei
Tsinghua University
Hanyu Lai
Tsinghua University
Yu Gu
The Ohio State University
Hangliang Ding
Tsinghua University
Kaiwen Men
Tsinghua University
Kejuan Yang
Tsinghua University
Shudan Zhang
Tsinghua University
Xiang Deng
The Ohio State University
Aohan Zeng
Tsinghua University
Zhengxiao Du
Tsinghua University
Chenhui Zhang
Tsinghua University
Sheng Shen
UC Berkeley
Tianjun Zhang
UC Berkeley
Yu Su
The Ohio State University
Huan Sun
The Ohio State University
Minlie Huang
Tsinghua University
Yuxiao Dong
Tsinghua University
Jie Tang
Tsinghua University

###### Abstract

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments.
We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent’s reasoning and decision-making abilities in a multi-turn open-ended generation setting.
Our extensive test over 27 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and OSS competitors.
We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents.
Training on code and high quality multi-turn alignment data could improve agent performance.
Datasets, environments, and an integrated evaluation package for AgentBench are released at [https://github.com/THUDM/AgentBench](https://github.com/THUDM/AgentBench "").

11footnotetext: XL and HY are lead authors that contributed equally. Email: {shawliu9,longinyh}@gmail.com22footnotetext: Work partially done when HY, YG visited Tsinghua University.33footnotetext: Website for AgentBench leaderboard & demos: [https://llmbench.ai/agent](https://llmbench.ai/agent "")![Refer to caption](https://ar5iv.labs.arxiv.org/html/2308.03688/assets/x1.png)Figure 1: An overview of LLMs on AgentBench. While LLMs begin to manifest their proficiency in LLM-as-Agent, gaps between models and the distance toward practical usability are significant.

### 1 Introduction

Intelligent agents and autonomous entities (Searle, [1970](https://ar5iv.labs.arxiv.org/html/2308.03688#bib.bib63 ""); Maes, [1994](https://ar5iv.labs.arxiv.org/html/2308.03688#bib.bib47 ""); Wooldridge & Jennings, [1995](https://ar5iv.labs.arxiv.org/html/2308.03688#bib.bib86 "")) that are capable of decision-making and action execution in particular environments have been key conc

... (truncated, 98 KB total)
Resource ID: d234ade2718a748e | Stable ID: N2M4MzIyMj