Back
WebArena: A Realistic Web Environment for Agentic AI Evaluation
webwebarena.dev·webarena.dev/
WebArena is a widely-used benchmark for testing LLM-based web agents; relevant to AI safety researchers studying agentic behavior, goal stability, and the risks of autonomous AI systems operating in real digital environments.
Metadata
Importance: 62/100tool pagetool
Summary
WebArena is a benchmark environment for evaluating autonomous web-browsing AI agents on realistic, long-horizon tasks across functional websites (e-commerce, forums, code repos, etc.). It tests agents' ability to complete complex multi-step goals requiring planning, navigation, and tool use in a self-hosted web ecosystem. The benchmark helps measure progress and identify limitations in agentic AI systems operating in realistic digital environments.
Key Points
- •Provides a self-hosted, realistic web environment with functional sites (e-commerce, Reddit-like forums, GitLab, maps) for agent evaluation
- •Tasks require multi-step planning, web navigation, and form interaction — testing practical agentic capabilities beyond single-turn QA
- •Includes ~800 long-horizon tasks with diverse goal types, enabling rigorous benchmarking of LLM-based agents
- •Highlights significant performance gaps between current AI agents and human baselines, revealing capability limitations
- •Relevant to AI safety research on goal-directed agents, alignment under ambiguity, and preventing unintended actions in open-ended web tasks
Cited by 6 pages
| Page | Type | Quality |
|---|---|---|
| Long-Horizon Autonomous Tasks | Capability | 65.0 |
| Tool Use and Computer Use | Capability | 67.0 |
| Heavy Scaffolding / Agentic Systems | Concept | 57.0 |
| Light Scaffolding | Capability | 53.0 |
| Minimal Scaffolding | Capability | 52.0 |
| Eval Saturation & The Evals Gap | Approach | 65.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202698 KB
# WebArena: A Realistic Web Environment for Building Autonomous Agents
[Shuyan Zhou](https://shuyanzhou.github.io/) 1\*,[Frank F. Xu](https://frankxfz.me/) 1\*,
[Hao Zhu](https://www.zhuhao.me/) 1+,
[Xuhui Zhou](https://xuhuiz.com/) 1+,
[Robert Lo](https://robertlo.tech/) 1+,
[Abishek Sridhar](https://www.linkedin.com/in/abishek-sridhar5/) 1+,
[Xianyi Cheng](https://xianyicheng.github.io/) 1,
[Tianyue Ou](https://xianyicheng.github.io/) 1,
[Yonatan Bisk](https://yonatanbisk.com/) 1,
[Daniel Fried](https://dpfried.github.io/) 1,
[Uri Alon](https://urialon.ml/) 1,
[Graham Neubig](http://www.phontron.com/) 1,2.
1Carnegie Mellon University,2Inspired Cognition
\*Lead contributors.+Equal contribution.
{shuyanzh,fangzhex,gneubig}@cs.cmu.edu
[Paper](https://arxiv.org/pdf/2307.13854.pdf)[Code](https://github.com/web-arena-x/webarena)[Data](https://github.com/web-arena-x/webarena/blob/main/config_files/test.raw.json)[Docker Environment](https://github.com/web-arena-x/webarena/tree/main/environment_docker)[Leaderboard](https://docs.google.com/spreadsheets/d/1M801lEpBbKSNwP-vDBkC_pF7LdyGU1f_ufZb_NWNBZQ/edit?usp=sharing)
[Our new benchmark TheAgentCompany!](https://the-agent-company.com/)

## WebArena is a standalone, self-hostable web environment for building autonomous agents. WebArena creates websites from four popular categories with functionality and data mimicking their real-world equivalents. To emulate human problem-solving, WebArena also embeds tools and knowledge resources as independent websites. WebArena introduces a benchmark on interpreting high-level realistic natural language command to concrete web-based interactions. We provide annotated programs designed to programmatically validate the functional correctness of each task.
## WebArena Website Demos
The videos demonstrate various tasks that can be performed in WebArena.
## Try it yourself:
[Social Forum](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:9999/forums/all)[Online Shopping](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:7770/)[Content Management (u: admin, p: admin1234)](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:7780/admin)[Collaborative Software Development](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:8023/explore)[Map](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:3000/)[Wiki](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing)[Calculator](http://metis.lti.cs.cmu.edu:4399/calculator.html)[Scratchpad](http://metis.lti.cs.cmu.edu:4399/scratchpad.html)
## Agent on Gitlab
"Set up a new, empty repository with the name awesome\_llm\_reading"
## Agent on Shopping Website
"Tell me the status of my latest order and when will it arrive"
## Realistic Tasks on WebArena

A high-level task that can be fully executed in WebArena.
Comple
... (truncated, 98 KB total)Resource ID:
c2614357fa198ba4 | Stable ID: YTI1MGY0Nj