OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

web

os-world.github.io·os-world.github.io/

OSWorld is a prominent benchmark for autonomous computer-use agents; relevant to AI safety discussions around agentic systems, capability evaluation, and the gap between current AI performance and human-level task completion in real-world environments.

Metadata

Importance: 72/100tool pagedataset

Summary

OSWorld is a benchmark for evaluating multimodal AI agents performing open-ended tasks in real computer environments across multiple operating systems. It tests agents' ability to use GUIs, execute code, and interact with real applications like browsers, file systems, and productivity tools. The benchmark reveals that current state-of-the-art models achieve very low success rates compared to humans, highlighting a significant capability gap.

Key Points

•Provides a scalable, real-computer environment supporting Windows, macOS, and Linux for evaluating multimodal agents on complex tasks
•Includes 369 real-world computer tasks spanning web browsing, file management, coding, and multi-app workflows
•Current best models achieve only ~12% success rate versus ~72% for humans, revealing large performance gaps
•Evaluates agents using both GUI-based (screenshot) and accessibility-tree observation modes
•Serves as a key benchmark for measuring progress toward autonomous computer-use AI agents

Cited by 2 pages

Page	Type	Quality
Tool Use and Computer Use	Capability	67.0
Eval Saturation & The Evals Gap	Approach	65.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202617 KB

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments 
 

 
 
 

 

 
 
 
 

 
 
 
 

 
 
 

 
 

 

 
 
 
 
 
 

 

 

 
 
 
 
 
 
 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
 

 
 
 Tianbao Xie 1 , 
 
 Danyang Zhang 1 , 
 
 Jixuan Chen 1 , 
 
 Xiaochuan Li 1 , 
 

 
 Siheng Zhao 1 , 
 
 Ruisheng
 Cao 1 , 
 
 Toh Jing Hua 1 , 
 
 Zhoujun Cheng 1 , 
 
 Dongchan
 Shin 1 , 
 
 Fangyu Lei 1 , 
 
 Yitao Liu 1 , 
 
 Yiheng Xu 1 , 
 
 Shuyan Zhou 3 , 
 
 Silvio Savarese 2 , 
 
 Caiming Xiong 2 , 
 
 Victor Zhong 4 , 
 
 Tao Yu 1 
 

 
 1 The University of Hong Kong, 
 2 Salesforce Research, 
 3 Carnegie Mellon University, 
 4 University of Waterloo 
 

 

 -->
 
 2025-07-28: Major Upgrade! OSWorld has been enhanced and is now OSWorld-Verified with comprehensive improvements : fixed community-reported examples , AWS support reducing evaluation time to within 1 hour , and updated benchmark results . See the verified benchmark results in the Benchmark section below. Please compare your OSWorld results with the new benchmark results when running the latest version.
 

 -->

 
 
 
 -->
 -->
 -->
 -->
 -->
 Paper -->
 -->
 -->
 
 
 
 
 
 Paper 
 
 
 
 -->
 
 -->
 -->
 -->
 -->
 Video -->
 -->
 -->
 
 
 
 
 
 
 Code 
 
 
 
 
 
 
 
 
 Doc 
 
 
 
 
 
 
 
 
 Data 
 
 
 
 
 
 
 
 
 Data Viewer 
 
 
 
 
 
 
 
 
 Slides 
 
 
 
 
 
 
 
 
 Twitter 
 
 
 
 
 
 
 
 
 Discord 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 

 
 
 
 **OSWorld** is a first-of-its-kind scalable, real computer environment for multimodal agents,
 supporting task setup, execution-based evaluation, and interactive learning across operating systems.
 It can serve as a unified environment for evaluating open-ended computer tasks that involve arbitrary
 apps (e.g., task examples in the above Fig). We also create a benchmark of 369 real-world computer
 tasks in **OSWorld** with reliable, reproducible setup and evaluation scripts. *Note: 8 Google Drive tasks may require manual configuration or can be excluded (361 tasks) due to network dependencies.*
 
 
 
 
 

 
 
 
 
 
 Abstract

 
 
 Autonomous agents that accomplish complex computer tasks with minimal human
 interventions have the potential to transform human-computer interaction, significantly enhancing
 accessibility and productivity. However, existing benchmarks
 either lack an interactive environment or are limited to environments specific to
 certain applications or domains, failing to reflect the diverse and complex nature of real-world computer
 use, thereby limiting the scope of tasks and agent
 scalability. To address this issue, we introduce **OSWorld**, the first-of-its-kind
 scalable, real computer environment for multimodal agents, supporting task setup,
 execution-based evaluation, and interactive learning across various operating systems such as Ubuntu,
 Windows, and macOS. **OSWorld** can serve as a unified,
 integrated computer environment for as

... (truncated, 17 KB total)

Resource ID: c819ef71cbf34802 | Stable ID: sid_45L3hYpyno