WebArena

Agentic

A realistic web agent benchmark where AI must complete tasks on self-hosted replicas of real websites (Reddit, GitLab, shopping, maps). Tests multi-step web navigation and interaction.

Wiki page →Website →

Models Tested

2

Best Score

53%

Median Score

33.7%

Scoring: percentage

Introduced: 2023-07

Maintainer: CMU

Leaderboard2 models

#	Model	Developer	Score
🥇	GPT-4o	OpenAI	53%
🥈	GPT-4	OpenAI	14.4%