Back
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
webos-world.github.io·os-world.github.io/
OSWorld is a prominent benchmark for autonomous computer-use agents; relevant to AI safety discussions around agentic systems, capability evaluation, and the gap between current AI performance and human-level task completion in real-world environments.
Metadata
Importance: 72/100tool pagedataset
Summary
OSWorld is a benchmark for evaluating multimodal AI agents performing open-ended tasks in real computer environments across multiple operating systems. It tests agents' ability to use GUIs, execute code, and interact with real applications like browsers, file systems, and productivity tools. The benchmark reveals that current state-of-the-art models achieve very low success rates compared to humans, highlighting a significant capability gap.
Key Points
- •Provides a scalable, real-computer environment supporting Windows, macOS, and Linux for evaluating multimodal agents on complex tasks
- •Includes 369 real-world computer tasks spanning web browsing, file management, coding, and multi-app workflows
- •Current best models achieve only ~12% success rate versus ~72% for humans, revealing large performance gaps
- •Evaluates agents using both GUI-based (screenshot) and accessibility-tree observation modes
- •Serves as a key benchmark for measuring progress toward autonomous computer-use AI agents
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Tool Use and Computer Use | Capability | 67.0 |
| Eval Saturation & The Evals Gap | Approach | 65.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202668 KB
# OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
[Tianbao Xie](https://tianbaoxie.com/) 1,[Danyang Zhang](https://zdy023.github.io/) 1,[Jixuan Chen](https://chenjix.github.io/) 1,[Xiaochuan Li](https://xiaochuanli.com/) 1,
[Siheng Zhao](https://sihengz02.github.io/) 1,[Ruisheng\\
Cao](https://scholar.google.com/citations?user=NdK881sAAAAJ&hl=zh-CN) 1,[Toh Jing Hua](https://me.tjh.sg/) 1,[Zhoujun Cheng](https://blankcheng.github.io/) 1,[Dongchan\\
Shin](https://www.linkedin.com/in/dongchan-shin-2a4890275/?trk=public_profile_samename-profile&originalSubdomain=hk) 1,[Fangyu Lei](https://lfy79001.github.io/) 1,[Yitao Liu](https://yitaoliu17.com/) 1,[Yiheng Xu](https://yihengxu.com/) 1,[Shuyan Zhou](https://shuyanzhou.github.io/) 3,[Silvio Savarese](https://cvgl.stanford.edu/silvio/) 2,[Caiming Xiong](http://cmxiong.com/) 2,[Victor Zhong](https://www.victorzhong.com/) 4,[Tao Yu](https://taoyds.github.io/) 1
1The University of Hong Kong,2Salesforce Research,3Carnegie Mellon University,4University of Waterloo
## **2025-07-28: Major Upgrade! OSWorld has been enhanced and is now [OSWorld-Verified](https://xlang.ai/blog/osworld-verified) with comprehensive improvements: fixed [community-reported examples](https://docs.google.com/spreadsheets/d/19GSOVCtYM7j3V84Zl5QiaeiEtgK_NIvqtDVXEoVb4U0/edit?gid=0\#gid=0), AWS support reducing evaluation time to within 1 hour, and updated benchmark results.** See the verified benchmark results in the [Benchmark section](https://os-world.github.io/\#benchmark) below. Please compare your OSWorld results with the new benchmark results when running the latest version.
[Paper](https://arxiv.org/abs/2404.07972)[Code](https://github.com/xlang-ai/OSWorld)[Doc](https://timothyxxx.github.io/OSWorld/)[Data](https://github.com/xlang-ai/OSWorld/tree/main/evaluation_examples)[Data Viewer](https://os-world.github.io/explorer.html)[Slides](https://docs.google.com/presentation/d/1-r889Nb9n7SeZqrj-ryNqJLoMzp7aGNU2ihO8nUdEcE/edit?usp=sharing)[Twitter](https://twitter.com/TianbaoX/status/1778781521253667267)[Discord](https://discord.gg/4Gnw7eTEZR)

**OSWorld** is a first-of-its-kind scalable, real computer environment for multimodal agents,
supporting task setup, execution-based evaluation, and interactive learning across operating systems.
It can serve as a unified environment for evaluating open-ended computer tasks that involve arbitrary
apps (e.g., task examples in the above Fig). We also create a benchmark of 369 real-world computer
tasks in **OSWorld** with reliable, reproducible setup and evaluation scripts. _Note: 8 Google Drive tasks may require manual configuration or can be excluded (361 tasks) due to network dependencies._
## Abstract
Autonomous agents that accomplish complex computer tasks with minimal human
interventions have the potential to transform human-computer interaction, sig
... (truncated, 68 KB total)Resource ID:
c819ef71cbf34802 | Stable ID: ZjgzNGEyMG