Skip to content
Longterm Wiki
Back

WebArena: A Realistic Web Environment for Agentic AI Evaluation

web
webarena.dev·webarena.dev/

WebArena is a widely-used benchmark for testing LLM-based web agents; relevant to AI safety researchers studying agentic behavior, goal stability, and the risks of autonomous AI systems operating in real digital environments.

Metadata

Importance: 62/100tool pagetool

Summary

WebArena is a benchmark environment for evaluating autonomous web-browsing AI agents on realistic, long-horizon tasks across functional websites (e-commerce, forums, code repos, etc.). It tests agents' ability to complete complex multi-step goals requiring planning, navigation, and tool use in a self-hosted web ecosystem. The benchmark helps measure progress and identify limitations in agentic AI systems operating in realistic digital environments.

Key Points

  • Provides a self-hosted, realistic web environment with functional sites (e-commerce, Reddit-like forums, GitLab, maps) for agent evaluation
  • Tasks require multi-step planning, web navigation, and form interaction — testing practical agentic capabilities beyond single-turn QA
  • Includes ~800 long-horizon tasks with diverse goal types, enabling rigorous benchmarking of LLM-based agents
  • Highlights significant performance gaps between current AI agents and human baselines, revealing capability limitations
  • Relevant to AI safety research on goal-directed agents, alignment under ambiguity, and preventing unintended actions in open-ended web tasks

Cited by 6 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# WebArena: A Realistic Web Environment for Building Autonomous  Agents

[Shuyan Zhou](https://shuyanzhou.github.io/) 1\*,[Frank F. Xu](https://frankxfz.me/) 1\*,

[Hao Zhu](https://www.zhuhao.me/) 1+,
[Xuhui Zhou](https://xuhuiz.com/) 1+,
[Robert Lo](https://robertlo.tech/) 1+,
[Abishek Sridhar](https://www.linkedin.com/in/abishek-sridhar5/) 1+,


[Xianyi Cheng](https://xianyicheng.github.io/) 1,
[Tianyue Ou](https://xianyicheng.github.io/) 1,
[Yonatan Bisk](https://yonatanbisk.com/) 1,
[Daniel Fried](https://dpfried.github.io/) 1,
[Uri Alon](https://urialon.ml/) 1,
[Graham Neubig](http://www.phontron.com/) 1,2.


1Carnegie Mellon University,2Inspired Cognition

\*Lead contributors.+Equal contribution.

{shuyanzh,fangzhex,gneubig}@cs.cmu.edu

[Paper](https://arxiv.org/pdf/2307.13854.pdf)[Code](https://github.com/web-arena-x/webarena)[Data](https://github.com/web-arena-x/webarena/blob/main/config_files/test.raw.json)[Docker Environment](https://github.com/web-arena-x/webarena/tree/main/environment_docker)[Leaderboard](https://docs.google.com/spreadsheets/d/1M801lEpBbKSNwP-vDBkC_pF7LdyGU1f_ufZb_NWNBZQ/edit?usp=sharing)

[Our new benchmark TheAgentCompany!](https://the-agent-company.com/)

![](https://webarena.dev/static/images/overview.png)

## WebArena is a standalone, self-hostable web environment for building  autonomous agents. WebArena creates websites from four popular  categories with functionality and data mimicking their real-world  equivalents. To emulate human problem-solving, WebArena also embeds  tools and knowledge resources as independent websites. WebArena  introduces a benchmark on interpreting high-level realistic natural  language command to concrete web-based interactions. We provide  annotated programs designed to programmatically validate the  functional correctness of each task.

## WebArena Website Demos

The videos demonstrate various tasks that can be performed in WebArena.


## Try it yourself:

[Social Forum](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:9999/forums/all)[Online Shopping](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:7770/)[Content Management (u: admin, p: admin1234)](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:7780/admin)[Collaborative Software Development](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:8023/explore)[Map](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:3000/)[Wiki](http://ec2-3-131-244-37.us-east-2.compute.amazonaws.com:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing)[Calculator](http://metis.lti.cs.cmu.edu:4399/calculator.html)[Scratchpad](http://metis.lti.cs.cmu.edu:4399/scratchpad.html)

## Agent on Gitlab

"Set up a new, empty repository with the name awesome\_llm\_reading"

## Agent on Shopping Website

"Tell me the status of my latest order and when will it arrive"

## Realistic Tasks on WebArena

![](https://webarena.dev/static/images/tasks.png)

A high-level task that can be fully executed in WebArena.
Comple

... (truncated, 98 KB total)
Resource ID: c2614357fa198ba4 | Stable ID: YTI1MGY0Nj