Skip to content
Longterm Wiki

OSWorld

Agentic

A benchmark for multimodal agents on real-world computer tasks across operating systems, testing GUI interaction and task completion.

Models Tested
2
Best Score
72.7%
Median Score
67.05%
Scoring: percentage
Introduced: 2024-04
Maintainer: CMU / HKU

Leaderboard2 models

#ModelDeveloperScore
🥇Claude Opus 4.6Anthropic
72.7%
🥈Claude Sonnet 4.5Anthropic
61.4%