Skip to content
Longterm Wiki

OSWorld

Agentic
A benchmark for multimodal agents on real-world computer tasks across operating systems, testing GUI interaction and task completion.
Models Tested
3
Best Score
72.7%
Median Score
61.4%
Scoring: percentage
Introduced: 2024-04
Maintainer: CMU / HKU

Leaderboard (3 models)

#ModelDeveloperScore
🥇Claude Opus 4.6Anthropic
72.7%
🥈Claude Sonnet 4.5Anthropic
61.4%
🥉Claude Haiku 4.5Anthropic
50.7