Skip to content
Longterm Wiki

WebArena

Agentic

A realistic web agent benchmark where AI must complete tasks on self-hosted replicas of real websites (Reddit, GitLab, shopping, maps). Tests multi-step web navigation and interaction.

Models Tested
2
Best Score
53%
Median Score
33.7%
Scoring: percentage
Introduced: 2023-07
Maintainer: CMU

Leaderboard2 models

#ModelDeveloperScore
🥇GPT-4oOpenAI
53%
🥈GPT-4OpenAI
14.4%