Tool Use and Computer Use
Tool Use and Computer Use
Tool use capabilities achieved superhuman computer control in late 2025 (OSAgent: 76.26% vs 72% human baseline) and near-human coding (Claude Opus 4.5: 80.9% SWE-bench Verified), but prompt injection remains the #1 AI vulnerability affecting 73% of deployments with OpenAI admitting it 'may never be fully solved.' Only 34.7% of organizations have deployed defenses while 97M+ monthly MCP SDK downloads indicate rapid proliferation.
Quick Assessment
| Dimension | Assessment | Evidence |
|---|---|---|
| Capability Maturity | Superhuman on benchmarks | OSAgent: 76.26% vs 72% human baseline on OSWorld (Oct 2025) |
| Coding Performance | Near-human | Claude Opus 4.5: 80.9% on SWE-bench Verified; 45.89% on SWE-bench Pro |
| Security Posture | Critical vulnerabilities | OWASP 2025: Prompt injection #1 vulnerability in 73% of deployments |
| Defense Readiness | Inadequate | Only 34.7% of organizations have deployed prompt injection defenses (VentureBeat) |
| Improvement Rate | Exponential | 18x improvement on SWE-bench since 2023; OSWorld 5x in 12 months |
| Industry Adoption | Accelerating | MCP: 97M+ monthly SDK downloads; 28% of Fortune 500 implemented |
| Fundamental Fix | Unlikely | OpenAI admits prompt injection "may never be fully solved" |
Key Links
| Source | Link |
|---|---|
| Wikipedia | en.wikipedia.org |
Overview
Tool use capabilities represent one of the most significant developments in AI systems, transforming language models from passive text generators into active agents capable of interacting with the external world. These capabilities span from simple API calls to sophisticated computer control, enabling AI systems to execute code, browse the web, manipulate files, and even operate desktop applications through mouse and keyboard control. The progression from Claude's computer use beta in October 2024 to increasingly sophisticated implementations across major AI labs demonstrates the rapid advancement of this critical capability area.
This evolution matters because it fundamentally changes the nature of AI systems from advisory tools that can only provide text-based recommendations to autonomous agents capable of taking concrete actions in digital environments. The implications extend far beyond enhanced functionality—tool use capabilities create new attack surfaces, complicate safety monitoring, and enable both beneficial applications like automated research assistance and concerning uses like autonomous cyber operations. As these systems become more sophisticated, understanding their capabilities, limitations, and safety implications becomes crucial for responsible deployment and governance.
The trajectory toward more capable tool-using agents appears inevitable, with major AI labs investing heavily in this area. However, the dual-use nature of these capabilities—where the same functionality that enables beneficial automation also enables potential harm—presents unique challenges for safety research and policy development that distinguish tool use from other AI capability advances.
Capability Assessment
| Dimension | Current State (Late 2025) | Trend | Safety Relevance |
|---|---|---|---|
| Function Calling | Mature — BFCL benchmark shows Claude Opus 4.1 at 70.36%, standardized via MCP | Stable | Moderate — Well-defined interfaces enable monitoring |
| Web Browsing | Advanced — ChatGPT agent/Operator integrated into main products | Continued improvement | High — OpenAI admits prompt injection "may never be fully solved" |
| Code Execution | Strong — SWE-bench Verified reaches 80.9% (Claude Opus 4.5) | Rapid improvement | High — Potential for malware, system manipulation |
| Computer Use | Superhuman — OSAgent 76.26% vs. 72% human baseline on OSWorld | Milestone achieved | Very High — Universal interface bypasses API restrictions |
| Multi-Agent Orchestration | Advancing — MCP standardization enables cross-platform coordination | Rapid development | Critical — 100% of multi-agent systems vulnerable to inter-agent exploits |
Technical Foundations and Current Implementations
Modern tool use systems typically employ a structured approach where AI models receive descriptions of available tools, generate properly formatted function calls, execute these calls in controlled environments, and process the results to continue task completion. This architecture has been implemented with varying degrees of sophistication across major AI systems. OpenAI's function calling, introduced in June 2023, established early patterns for structured API invocation with JSON schema validation and support for parallel tool execution. Google's Gemini Extensions focused on deep integration with Google's ecosystem, enabling cross-service workflows between Gmail, Calendar, and Drive.
Anthropic's Computer Use capability, launched in public beta in October 2024, represents a significant advancement by enabling direct desktop interaction. The system can take screenshots, interpret visual interfaces, move the mouse cursor, and provide keyboard input to control any application a human could operate. This universal interface approach eliminates the need for custom API integrations, though it currently operates more slowly than human users and struggles with complex visual interfaces or applications requiring rapid real-time interaction.
The underlying technical implementation relies heavily on vision-language models that can interpret screenshots and translate high-level instructions into specific UI interactions. Training these systems involves a combination of supervised fine-tuning on human demonstrations, reinforcement learning from successful task completion, and synthetic data generation. The challenge lies in teaching models both the mechanical aspects of tool operation (correct function call formatting, proper argument passing) and the strategic aspects (when to use which tools, how to recover from errors, how to chain tools effectively).
Current limitations include tool selection errors (occurring in 15-30% of complex multi-step tasks), brittle error recovery mechanisms, and difficulty with novel tools not seen during training. Research on OSWorld-Human shows that even high-performing agents take 1.4-2.7x more steps than necessary, with 75-94% of execution time spent on planning and reflection calls rather than actual task execution. Most implementations require careful prompt engineering and work best with familiar, well-documented tools rather than adapting flexibly to new interfaces or APIs.
Benchmark Performance Comparison
Performance on tool use benchmarks reveals both rapid progress and, in some cases, superhuman achievement:
| Benchmark | Task Type | Human Baseline | Best AI (2024) | Best AI (Late 2025) | Key Insight |
|---|---|---|---|---|---|
| OSWorld↗🔗 webOSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsOSWorld is a prominent benchmark for autonomous computer-use agents; relevant to AI safety discussions around agentic systems, capability evaluation, and the gap between current AI performance and human-level task completion in real-world environments.OSWorld is a benchmark for evaluating multimodal AI agents performing open-ended tasks in real computer environments across multiple operating systems. It tests agents' ability ...capabilitiesevaluationcomputer-usemultimodal+3Source ↗ | Computer control | 72.4% | 14.9% (Claude 3.5) | 76.26% (OSAgent) | Superhuman achieved Oct 2025 |
| SWE-bench Verified↗🔗 web★★★★☆OpenAISWE-bench Verified - OpenAISWE-bench Verified is a curated subset of the SWE-bench coding benchmark, important for those evaluating the real-world software engineering capabilities of AI agents, especially as agentic systems become more prominent in safety-relevant deployment contexts.OpenAI collaborated with human software developers to audit and filter the original SWE-bench benchmark, removing problematic or ambiguous test samples to create SWE-bench Verif...capabilitiesevaluationagentictool-use+3Source ↗ | Code issue resolution | ≈92% | 49.0% (Claude 3.5) | 80.9% (Claude Opus 4.5) | Near-human; 18x improvement since 2023 |
| SWE-bench Pro | Private codebases | — | — | 22.7% (Claude Opus 4.1) | More realistic; drops to 17.8% on commercial code |
| GAIA↗📄 paper★★★☆☆arXiv[2311.12983] GAIA: a benchmark for General AI AssistantsGAIA is widely cited as evidence that current AI systems lack robust general-purpose capabilities despite excelling on narrow professional benchmarks, making it relevant to both capabilities assessment and realistic AI safety evaluations.Grégoire Mialon, Clémentine Fourrier, Craig Swift et al. (2023)609 citationsGAIA introduces a benchmark of 466 real-world questions requiring reasoning, multimodal handling, web browsing, and tool use, revealing a stark performance gap: humans achieve 9...evaluationcapabilitiesagillm+4Source ↗ | General assistant tasks | 92% | 15% (GPT-4) | 75% (h2oGPTe) | First "C grade" achieved in 2025 |
| BFCL | Function calling | — | — | 70.36% (Claude Opus 4.1) | GPT-5 at 59.22%; Chinese models competitive |
| WebArena↗🔗 webWebArena: A Realistic Web Environment for Agentic AI EvaluationWebArena is a widely-used benchmark for testing LLM-based web agents; relevant to AI safety researchers studying agentic behavior, goal stability, and the risks of autonomous AI systems operating in real digital environments.WebArena is a benchmark environment for evaluating autonomous web-browsing AI agents on realistic, long-horizon tasks across functional websites (e-commerce, forums, code repos,...evaluationcapabilitiesagenticcomputer-use+4Source ↗ | Web navigation | — | ≈35% | ≈55% | Realistic web task completion |
The trajectory of improvement has been extraordinary. On OSWorld, the best AI agent went from 14.9% in October 2024 to 76.26% in October 2025—a 5.1x improvement that crossed the 72% human baseline. Claude's OSWorld performance improved 45% in just four months (from 42.2% to 61.4% with Sonnet 4.5). The dramatic improvement in SWE-bench↗🔗 webSWE-bench Official LeaderboardsSWE-bench is a key industry benchmark for tracking AI coding agent capabilities; useful for understanding the pace of progress in autonomous software engineering, which has implications for AI-assisted research and recursive self-improvement risks.SWE-bench is a benchmark and leaderboard platform for evaluating AI models on real-world software engineering tasks, particularly resolving GitHub issues in open-source Python r...capabilitiesevaluationagentictool-use+3Source ↗ scores—from 4.4% in 2023 to 80.9% by late 2025 (18x improvement)—illustrates how rapidly agentic coding capabilities are advancing. On SWE-bench Pro, Claude Opus 4.5 leads at 45.89%, with Claude Sonnet 4.5 at 43.60%.
However, performance drops significantly on private codebases that models haven't seen during training. On SWE-bench Pro's commercial code subset, Claude Opus 4.1 drops from 22.7% to 17.8%, and GPT-5 falls from 23.1% to 14.9%. This suggests current high scores may partially reflect training data contamination rather than genuine generalization.
Safety Implications and Risk Landscape
Tool use capabilities introduce qualitatively different safety challenges compared to text-only AI systems. The fundamental shift from advisory outputs to autonomous action creates persistent consequences that extend beyond the AI system itself. When a language model generates harmful text, the damage remains contained to that output; when a tool-using agent executes malicious code or manipulates external systems, the effects can propagate across networks and persist indefinitely.
The expanded attack surface represents a critical concern. Each tool integration introduces potential vulnerabilities, from SQL injection through database APIs to privilege escalation through system command execution. Research by anthropic and other labs has demonstrated that current jailbreak techniques can be adapted to tool use contexts, where seemingly benign tool calls can be chained together to achieve harmful objectives. For example, a model might use legitimate web browsing tools to gather information for social engineering attacks, or combine file system access with network tools to exfiltrate sensitive data.
Monitoring and oversight become significantly more complex with tool-using agents. Traditional safety measures designed for text outputs—such as content filtering or human review of responses—prove inadequate when models can take rapid sequences of actions through external interfaces. The combinatorial explosion of possible tool interactions makes it difficult to anticipate all potential misuse patterns, and the speed of automated tool execution can outpace human oversight capabilities.
The challenge of maintaining meaningful human control becomes acute when agents can operate autonomously across multiple tools and time horizons. Current approaches like requiring human approval for specific actions face the fundamental tension between preserving utility (which requires minimizing friction) and maintaining safety (which requires meaningful oversight). As tool use becomes more sophisticated, this tension will likely intensify.
Security Vulnerability Landscape
Research on AI agent security has revealed alarming vulnerability rates. According to a comprehensive study on agent security↗📄 paper★★★☆☆arXivcomprehensive study on agent securityA 2024 UC Davis paper systematically cataloging security vulnerabilities in LLM-based AI agents, relevant to researchers and practitioners deploying agentic systems in real-world environments.Yifeng He, Ethan Wang, Yuyang Rong et al. (2024)25 citationsThis paper provides a comprehensive security analysis of LLM-based AI agents, identifying critical vulnerabilities arising from their tool-use and command-execution capabilities...ai-safetytechnical-safetyred-teamingdeployment+4Source ↗, the attack surface for tool-using agents is significantly larger than for text-only systems. According to OWASP's 2025 Top 10 for LLM Applications, prompt injection ranks as the #1 critical vulnerability, appearing in over 73% of production AI deployments assessed during security audits.
Attack Volume and Defense Gap
| Metric | Value | Source |
|---|---|---|
| Attack sessions targeting AI infrastructure (Q4 2025) | 91,000+ | eSecurity Planet |
| Organizations with prompt injection defenses | 34.7% | VentureBeat survey |
| Attack success rate against state-of-the-art defenses | 85%+ | MDPI meta-analysis (78 studies, 2021-2026) |
| RAG poisoning success with 5 documents | 90% | Lakera research |
| Organizations using AI tools in cloud | 84% | Orca Security |
| Organizations with vulnerable AI packages | 62% | Orca Security |
| Cloud breaches involving AI workloads | 33% | Cloud Security Alliance |
| Vulnerability Type | Prevalence | Severity | Example Attack |
|---|---|---|---|
| Prompt Injection | 94.4% of agents vulnerable; OWASP #1 threat | Critical | Malicious instructions hidden in web content |
| Retrieval-Based Backdoors | 83.3% vulnerable | High | Poisoned documents trigger unintended behavior |
| Inter-Agent Trust Exploits | 100% vulnerable | Critical | Compromised agent manipulates others in multi-agent systems |
| Memory Poisoning | Common | High | Gradual alteration of agent behavior through corrupted context |
| Excessive Agency | Common | High | Over-permissioned agents cause unintended damage |
In December 2025, OpenAI made a significant admission: prompt injection "may never be fully solved." In their technical blog on hardening ChatGPT Atlas, they stated that "prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully 'solved.'" The UK's National Cyber Security Centre issued a similar warning that prompt injection attacks against generative AI applications "may never be totally mitigated." A meta-analysis of 78 studies (2021-2026) found that attack success rates against state-of-the-art defenses exceed 85% when adaptive attack strategies are employed.
Real-world incidents have demonstrated these risks. The EchoLeak exploit (CVE-2025-32711)↗🔗 webEchoLeak exploit (CVE-2025-32711)A Unit 42 security research disclosure detailing a concrete agentic AI exploit; highly relevant for practitioners building or auditing AI agent systems that interact with external tools and data sources.Unit 42 (Palo Alto Networks) analyzes EchoLeak (CVE-2025-32711), a vulnerability in agentic AI systems that allows adversarial prompt injection via tool/function calls and API i...red-teamingtechnical-safetydeploymentai-safety+5Source ↗ against Microsoft Copilot showed how infected email messages containing engineered prompts could trigger automatic data exfiltration without user interaction. Experiments with OpenAI's Operator↗🔗 web★★★☆☆McKinsey & CompanyDeploying Agentic AI with Safety and Security: A Playbook for Technology LeadersThis McKinsey piece targets enterprise technology leaders deploying agentic AI, offering a business-risk perspective on AI safety and security; content was inaccessible at time of analysis, so metadata is inferred from URL and title.A McKinsey practitioner-oriented guide for technology leaders on safely deploying agentic AI systems in enterprise contexts. The resource likely covers risk frameworks, security...agenticdeploymentgovernancetechnical-safety+3Source ↗ demonstrated how agents could harvest personal data and automate credential stuffing attacks. Brave's security research on the Perplexity Comet vulnerability confirmed that indirect prompt injection is "not an isolated issue, but a systemic challenge facing the entire category of AI-powered browsers."
OpenAI developed an "LLM-based automated attacker"—a bot trained using reinforcement learning to discover prompt injection vulnerabilities. Unlike traditional red-teaming, this system can "steer an agent into executing sophisticated, long-horizon harmful workflows that unfold over tens (or even hundreds) of steps." In one demonstration, the automated attacker inserted a malicious email into a test inbox; when the AI agent scanned emails, it followed hidden instructions and sent a resignation message instead of the intended out-of-office reply.
Diagram (loading…)
flowchart TD
INPUT[User Input] --> AGENT[AI Agent]
TOOLS[Tool Access] --> AGENT
EXTERNAL[External Data] --> AGENT
AGENT --> ACTIONS[Autonomous Actions]
subgraph ATTACKS[Attack Vectors]
PI[Prompt Injection]
MP[Memory Poisoning]
TE[Tool Exploitation]
IC[Inter-Agent Compromise]
end
INPUT -.->|Malicious prompts| PI
EXTERNAL -.->|Poisoned content| PI
TOOLS -.->|Privilege escalation| TE
AGENT -.->|Corrupted context| MP
PI --> ACTIONS
MP --> ACTIONS
TE --> ACTIONS
IC --> ACTIONS
ACTIONS --> HARM[Persistent Harm]
style ATTACKS fill:#fee
style HARM fill:#fcc
style AGENT fill:#e6f3ffMcKinsey's agentic AI security playbook↗🔗 web★★★☆☆McKinsey & CompanyDeploying Agentic AI with Safety and Security: A Playbook for Technology LeadersThis McKinsey piece targets enterprise technology leaders deploying agentic AI, offering a business-risk perspective on AI safety and security; content was inaccessible at time of analysis, so metadata is inferred from URL and title.A McKinsey practitioner-oriented guide for technology leaders on safely deploying agentic AI systems in enterprise contexts. The resource likely covers risk frameworks, security...agenticdeploymentgovernancetechnical-safety+3Source ↗ emphasizes that organizations should enforce strong sandboxing with network restrictions, implement tamper-resistant logging of all agent actions, and maintain traceability mechanisms from the outset.
Computer Use as a Universal Interface
Computer use capabilities deserve special attention because they represent a universal interface that can potentially access any digital functionality available to human users. Unlike API-specific tool integrations that require custom development for each service, computer control enables AI agents to operate any software through the same visual interface humans use. This universality creates both tremendous potential and significant risks.
In October 2025, AI agents crossed the human baseline on computer control for the first time. AGI Inc.'s OSAgent achieved 76.26% on OSWorld, exceeding the approximately 72% human baseline. The agent uses continuous self-checking (the "verification-generation gap"): it verifies outcomes in real time and corrects on the next turn when a step fails. Training combines a general-reasoning base model with hundreds of thousands of synthetic tasks and real browser environments.
| Model/System | OSWorld Score | Date | Key Technique |
|---|---|---|---|
| Claude 3.5 (baseline) | 14.9% | Oct 2024 | Vision-language + screenshot analysis |
| Claude 3.7 | 28.0% | Feb 2025 | Improved planning and error recovery |
| Agent S2 + Claude 3.7 | 34.5% | Mar 2025 | Specialized agentic scaffolding |
| Claude Sonnet 4 | 42.2% | July 2025 | Enhanced tool use training |
| Claude Sonnet 4.5 | 61.4% | Nov 2025 | 45% improvement in 4 months |
| Claude Opus 4.5 | 66.3% | Nov 2025 | Extended autonomous operation |
| Agent S3 (Best-of-N) | 69.9% | Oct 2025 | Behavior Best-of-N techniques |
| OSAgent | 76.26% | Oct 2025 | Self-verification, synthetic data |
| Human baseline | ≈72% | — | — |
However, efficiency remains a significant limitation. Even high-performing agents take 1.4-2.7x more steps than necessary to complete tasks. What humans can accomplish in 30 seconds might take an agent 12 minutes—primarily because 75-94% of the time is spent on planning and reflection calls to large AI models rather than actual task execution.
Claude's updated computer use tool (January 2025) added new capabilities including hold_key, left_mouse_down, left_mouse_up, scroll, triple_click, and wait commands, plus a zoom feature for viewing specific screen regions at full resolution. These granular controls enable more precise UI interactions that were previously unreliable.
The implications of reliable computer use extend across virtually every domain of human digital activity. Positive applications include accessibility tools for users with disabilities, automated testing and quality assurance, and research assistance that can navigate complex information systems. Concerning applications include automated social engineering attacks, mass surveillance through social media manipulation, and autonomous malware that can adapt to novel security measures.
Tool Integration Standards: Model Context Protocol
The Model Context Protocol (MCP)↗✏️ blog★★★★☆AnthropicIntroducing the Model Context ProtocolMCP is Anthropic's proposed open standard for AI-tool connectivity; relevant to AI safety discussions around agentic systems, tool use, and how models interact with external environments in deployment contexts.Anthropic introduces the Model Context Protocol (MCP), an open standard that enables AI assistants to securely connect to external data sources, tools, and services. MCP provide...capabilitiesdeploymenttechnical-safetytools+2Source ↗, announced by Anthropic in November 2024, represents a significant step toward standardizing AI-tool integration. MCP addresses what engineers called the "M×N problem"—the combinatorial explosion of connecting M different AI models with N different tools or data sources. By providing a universal protocol, developers implement MCP once and unlock an entire ecosystem of integrations.
| Aspect | Details |
|---|---|
| Architecture | JSON-RPC 2.0 transport, similar to Language Server Protocol (LSP) |
| Primitives | Servers: Prompts, Resources, Tools; Clients: Roots, Sampling |
| SDKs | Python, TypeScript, C#, Java (97M+ monthly SDK downloads) |
| Pre-built Servers | 10,000+ published servers (Google Drive, Slack, GitHub, Git, Postgres, Puppeteer) |
| Adoption | Claude, ChatGPT, Gemini, Cursor, VS Code, Microsoft Copilot |
| Governance | Donated to Linux Foundation's Agentic AI Foundation (AAIF) Dec 2025 |
| Co-founders | Anthropic, Block, OpenAI (with support from Google, Microsoft, AWS, Cloudflare) |
MCP Adoption Trajectory
| Metric | Value | Date | Source |
|---|---|---|---|
| Monthly SDK downloads | 97M+ | Jan 2026 | MCP Manager |
| Registered MCP servers | 10,000+ | Jan 2026 | Pento |
| Fortune 500 adoption | 28% | Q1 2025 | Deepak Gupta |
| Fintech sector adoption | 45% | Q1 2025 | Deepak Gupta |
| Remote MCP server growth | 4x | May-Dec 2025 | Zuplo Report |
| Enterprise AI agent adoption projected | 85% | End of 2025 | CData |
| Market size projection | $4.5B | 2025 | Deepak Gupta |
In December 2025, MCP became a founding project of the newly created Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation. The donation ensures MCP "stays open, neutral, and community-driven as it becomes critical infrastructure for AI." OpenAI officially adopted MCP in March 2025, integrating the standard across its products including the ChatGPT desktop app.
The rapid uptake of MCP—with over 10,000 published servers and 97M+ monthly SDK downloads—suggests growing consensus around standardized tool integration. Fortune 500 adoption reached 28% by Q1 2025, with fintech leading at 45%. This standardization has dual implications for safety: it enables more consistent monitoring and security practices, but also accelerates the proliferation of tool-using capabilities across the AI ecosystem. Bloomberg noted that MCP provides "the essential connective layer required" for agentic AI systems that "do far more than simple question-answering."
Current State and Near-Term Trajectory
As of late 2025, tool use capabilities have reached several significant milestones. On OSWorld, AI agents now achieve superhuman performance (76.26% vs. 72% human baseline). Claude Opus 4.5 achieved 80.9% on SWE-bench Verified and demonstrated the ability to work autonomously for 30+ hours while maintaining focus on complex multi-step tasks. In one demonstration, Claude Sonnet 4.5 autonomously rebuilt Claude.ai's web application over approximately 5.5 hours with 3,000+ tool uses.
Despite accuracy improvements on benchmarks, efficiency remains a significant limitation. Research on OSWorld-Human reveals that even high-performing agents take 1.4-2.7x more steps than humans to complete tasks. What humans can accomplish in 30 seconds might take an agent 12 minutes—primarily because 75-94% of the time is spent on planning and reflection calls to large AI models.
Safety research has not kept pace with capability development. OpenAI's December 2025 admission that prompt injection "may never be fully solved" represents a significant acknowledgment. According to a VentureBeat survey, only 34.7% of organizations have deployed dedicated prompt injection defenses—leaving 65.3% unprotected. While defensive approaches are advancing—including Google DeepMind's CaMel framework (which treats LLMs as untrusted elements) and Microsoft's FIDES (using information-flow control)—no production-ready solution exists for the fundamental vulnerability.
The economic incentives for tool use development remain exceptionally strong. OpenAI's GPT-5 leads MCPMark performance at approximately $127.46 per benchmark run, compared to Claude Sonnet 4 at $152.41. Organizations recognize the potential for significant productivity gains through automated digital workflows, creating pressure for rapid deployment even before safety questions are fully resolved.
Key Uncertainties and Research Frontiers
Several critical uncertainties will shape the development of tool-using AI systems over the coming years. The scalability of current training approaches remains unclear—while supervised fine-tuning and reinforcement learning have produced impressive demonstrations, it's uncertain whether these methods can reliably teach agents to use arbitrary new tools or adapt to changing interfaces without extensive retraining.
The fundamental question of AI control in tool use contexts presents perhaps the most significant uncertainty. Current approaches to AI safety were developed primarily for language models that could only provide advice; extending these techniques to autonomous agents presents novel challenges that may require entirely new safety paradigms. The effectiveness of proposed solutions like constitutional AI, interpretability research, and formal verification methods for tool-using agents remains largely untested.
The interaction between tool use capabilities and other AI advances creates additional uncertainty. As models become more capable of long-term planning, steganography, and deception, the risks associated with tool use may increase non-linearly. Conversely, advances in AI safety research may provide new tools for monitoring and controlling autonomous agents.
Economic and regulatory responses will significantly influence the development trajectory. Industry self-regulation, government oversight, and international coordination efforts could substantially alter the pace and direction of tool use development. However, the dual-use nature of these capabilities makes targeted regulation challenging without hampering beneficial applications.
The technical question of whether safe, beneficial tool use is possible at scale remains open. While current systems demonstrate both impressive capabilities and significant safety challenges, it's unclear whether fundamental barriers exist to creating reliable, beneficial tool-using agents or whether current problems represent engineering challenges that will be resolved through continued research and development.
Timeline
| Date | Event | Significance |
|---|---|---|
| June 2023 | OpenAI introduces function calling↗🔗 web★★★★☆OpenAIOpenAI introduces function callingThis announcement is relevant to AI safety as function calling is a core capability enabling agentic LLM systems that autonomously interact with external tools, raising questions about oversight, reliability, and safe deployment of AI agents.OpenAI announced function calling capabilities for GPT-4 and GPT-3.5-turbo, enabling models to intelligently output structured JSON to call external tools and APIs. The update a...capabilitiesdeploymentapi-integrationfunction-calling+3Source ↗ | Establishes structured API invocation pattern for LLMs |
| Nov 2023 | GAIA benchmark released | First comprehensive test for general AI assistants with tool use |
| Apr 2024 | OSWorld benchmark published (NeurIPS 2024) | Standardized evaluation for computer control agents |
| Aug 2024 | SWE-bench Verified released↗🔗 web★★★★☆OpenAISWE-bench Verified - OpenAISWE-bench Verified is a curated subset of the SWE-bench coding benchmark, important for those evaluating the real-world software engineering capabilities of AI agents, especially as agentic systems become more prominent in safety-relevant deployment contexts.OpenAI collaborated with human software developers to audit and filter the original SWE-bench benchmark, removing problematic or ambiguous test samples to create SWE-bench Verif...capabilitiesevaluationagentictool-use+3Source ↗ | Human-validated coding benchmark; collaboration with OpenAI |
| Oct 2024 | Anthropic launches Computer Use beta↗✏️ blog★★★★☆AnthropicIntroducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 HaikuMarks a significant capability milestone for frontier AI: agentic computer control raises new safety and oversight challenges as AI systems can now autonomously interact with real software environments, relevant to discussions of AI action-taking and human oversight.Anthropic announces a major capability expansion: Claude 3.5 Sonnet gains 'computer use' ability (controlling mouse, keyboard, and screen), an upgraded Claude 3.5 Sonnet with im...capabilitiesdeploymenttechnical-safetyanthropic+3Source ↗ | First frontier model with direct desktop control |
| Nov 2024 | Model Context Protocol announced↗✏️ blog★★★★☆AnthropicIntroducing the Model Context ProtocolMCP is Anthropic's proposed open standard for AI-tool connectivity; relevant to AI safety discussions around agentic systems, tool use, and how models interact with external environments in deployment contexts.Anthropic introduces the Model Context Protocol (MCP), an open standard that enables AI assistants to securely connect to external data sources, tools, and services. MCP provide...capabilitiesdeploymenttechnical-safetytools+2Source ↗ | Open standard for AI-tool integration |
| Dec 2024 | Claude 3.5 Sonnet achieves 49% on SWE-bench | Significant jump in agentic coding capability |
| Jan 2025 | OpenAI launches Operator | Browser-based agentic AI with Computer-Using Agent (CUA) model |
| Feb 2025 | Claude 3.7 reaches 28% on OSWorld | Top leaderboard position at release |
| Mar 2025 | OpenAI officially adopts MCP | Integration across ChatGPT desktop app |
| Apr 2025 | Google DeepMind introduces CaMel framework | Treats LLMs as untrusted elements for security |
| July 2025 | ChatGPT agent mode launched | Operator integrated into main ChatGPT product |
| July 2025 | OSWorld-Verified released | Major benchmark updates, AWS parallelization support |
| Oct 2025 | OSAgent achieves 76.26% on OSWorld | First superhuman performance on computer control benchmark |
| Nov 2025 | Claude Opus 4.5 released | 80.9% on SWE-bench, 66.3% on OSWorld; 30+ hour autonomous operation |
| Dec 2025 | Scale AI releases SWE-bench Pro | Harder benchmark with private/commercial codebases |
| Dec 2025 | MCP donated to Linux Foundation AAIF | Industry standardization; co-founded by Anthropic, Block, OpenAI |
| Dec 2025 | OpenAI admits prompt injection "may never be fully solved" | Critical security acknowledgment for agentic AI |
Sources and Resources
Benchmarks and Evaluations
- GAIA Benchmark↗📄 paper★★★☆☆arXiv[2311.12983] GAIA: a benchmark for General AI AssistantsGAIA is widely cited as evidence that current AI systems lack robust general-purpose capabilities despite excelling on narrow professional benchmarks, making it relevant to both capabilities assessment and realistic AI safety evaluations.Grégoire Mialon, Clémentine Fourrier, Craig Swift et al. (2023)609 citationsGAIA introduces a benchmark of 466 real-world questions requiring reasoning, multimodal handling, web browsing, and tool use, revealing a stark performance gap: humans achieve 9...evaluationcapabilitiesagillm+4Source ↗ — General AI Assistants evaluation
- OSWorld↗🔗 webOSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsOSWorld is a prominent benchmark for autonomous computer-use agents; relevant to AI safety discussions around agentic systems, capability evaluation, and the gap between current AI performance and human-level task completion in real-world environments.OSWorld is a benchmark for evaluating multimodal AI agents performing open-ended tasks in real computer environments across multiple operating systems. It tests agents' ability ...capabilitiesevaluationcomputer-usemultimodal+3Source ↗ — Computer control benchmark (NeurIPS 2024)
- SWE-bench↗🔗 webSWE-bench Official LeaderboardsSWE-bench is a key industry benchmark for tracking AI coding agent capabilities; useful for understanding the pace of progress in autonomous software engineering, which has implications for AI-assisted research and recursive self-improvement risks.SWE-bench is a benchmark and leaderboard platform for evaluating AI models on real-world software engineering tasks, particularly resolving GitHub issues in open-source Python r...capabilitiesevaluationagentictool-use+3Source ↗ — Real-world coding issue resolution
- WebArena↗🔗 webWebArena: A Realistic Web Environment for Agentic AI EvaluationWebArena is a widely-used benchmark for testing LLM-based web agents; relevant to AI safety researchers studying agentic behavior, goal stability, and the risks of autonomous AI systems operating in real digital environments.WebArena is a benchmark environment for evaluating autonomous web-browsing AI agents on realistic, long-horizon tasks across functional websites (e-commerce, forums, code repos,...evaluationcapabilitiesagenticcomputer-use+4Source ↗ — Web navigation benchmark
- ToolEmu↗📄 paper★★★☆☆arXiv[2309.15817] Identifying the Risks of LM Agents with an LM-Emulated SandboxA key empirical paper on agentic LLM safety evaluation; introduces ToolEmu as a scalable testing sandbox relevant to researchers studying risks of tool-using AI agents prior to real-world deployment.Yangjun Ruan, Honghua Dong, Andrew Wang et al. (2023)250 citationsToolEmu is a framework that uses a language model to emulate tool execution for testing LM agents against safety risks, eliminating the need for manual tool implementation. It i...ai-safetyevaluationred-teamingcapabilities+3Source ↗ — Safety evaluation for LLM tool use
- SWE-bench Pro — Harder benchmark with private codebases (Scale AI, Dec 2025)
- OSWorld-Human — Efficiency benchmarking for computer-use agents
Industry Resources
- Anthropic Computer Use Announcement↗✏️ blog★★★★☆AnthropicIntroducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 HaikuMarks a significant capability milestone for frontier AI: agentic computer control raises new safety and oversight challenges as AI systems can now autonomously interact with real software environments, relevant to discussions of AI action-taking and human oversight.Anthropic announces a major capability expansion: Claude 3.5 Sonnet gains 'computer use' ability (controlling mouse, keyboard, and screen), an upgraded Claude 3.5 Sonnet with im...capabilitiesdeploymenttechnical-safetyanthropic+3Source ↗
- Model Context Protocol Documentation↗✏️ blog★★★★☆AnthropicIntroducing the Model Context ProtocolMCP is Anthropic's proposed open standard for AI-tool connectivity; relevant to AI safety discussions around agentic systems, tool use, and how models interact with external environments in deployment contexts.Anthropic introduces the Model Context Protocol (MCP), an open standard that enables AI assistants to securely connect to external data sources, tools, and services. MCP provide...capabilitiesdeploymenttechnical-safetytools+2Source ↗
- OpenAI Function Calling Guide↗🔗 web★★★★☆OpenAIOpenAI Function Calling GuideUseful reference for understanding the technical mechanics of tool-augmented LLMs and AI agents; relevant to safety discussions about agentic systems, capability elicitation, and controlling model actions in real-world environments.Official OpenAI documentation explaining how to enable language models to call external functions and APIs, allowing structured outputs and integration with real-world tools. It...capabilitiesdeploymentapi-integrationfunction-calling+4Source ↗
- Linux Foundation AAIF Announcement — MCP donation and foundation formation (Dec 2025)
- OpenAI Operator/ChatGPT Agent — Browser-based agentic AI
- Claude Opus 4.5 Announcement — 30+ hour autonomous operation
Security Research
- Security of AI Agents (arXiv)↗📄 paper★★★☆☆arXivcomprehensive study on agent securityA 2024 UC Davis paper systematically cataloging security vulnerabilities in LLM-based AI agents, relevant to researchers and practitioners deploying agentic systems in real-world environments.Yifeng He, Ethan Wang, Yuyang Rong et al. (2024)25 citationsThis paper provides a comprehensive security analysis of LLM-based AI agents, identifying critical vulnerabilities arising from their tool-use and command-execution capabilities...ai-safetytechnical-safetyred-teamingdeployment+4Source ↗ — Comprehensive vulnerability analysis
- McKinsey Agentic AI Security Playbook↗🔗 web★★★☆☆McKinsey & CompanyDeploying Agentic AI with Safety and Security: A Playbook for Technology LeadersThis McKinsey piece targets enterprise technology leaders deploying agentic AI, offering a business-risk perspective on AI safety and security; content was inaccessible at time of analysis, so metadata is inferred from URL and title.A McKinsey practitioner-oriented guide for technology leaders on safely deploying agentic AI systems in enterprise contexts. The resource likely covers risk frameworks, security...agenticdeploymentgovernancetechnical-safety+3Source ↗
- Google SAIF 2.0↗🔗 web★★★★☆Google AIGoogle Secure AI Framework (SAIF) 2.0: AI Security Strategy and ToolsRelevant to practitioners implementing secure AI systems; SAIF 2.0 bridges technical AI safety controls with organizational security practices, making it useful for teams navigating deployment-time risks and compliance requirements.Google's Secure AI Framework (SAIF) 2.0 outlines Google's evolving strategy for securing AI systems against emerging threats, including adversarial attacks, model theft, and dat...ai-safetytechnical-safetydeploymentred-teaming+3Source ↗ — Secure AI Framework for agents
- Palo Alto Unit 42: Agentic AI Threats↗🔗 webEchoLeak exploit (CVE-2025-32711)A Unit 42 security research disclosure detailing a concrete agentic AI exploit; highly relevant for practitioners building or auditing AI agent systems that interact with external tools and data sources.Unit 42 (Palo Alto Networks) analyzes EchoLeak (CVE-2025-32711), a vulnerability in agentic AI systems that allows adversarial prompt injection via tool/function calls and API i...red-teamingtechnical-safetydeploymentai-safety+5Source ↗
- OpenAI: Hardening Atlas Against Prompt Injection — Dec 2025 security admission
- OWASP Top 10 for LLM Applications 2025 — Prompt injection as #1 vulnerability
- Brave: Unseeable Prompt Injections — AI browser vulnerability research
AI Index and Analysis
- Stanford HAI AI Index 2025↗🔗 web★★★★☆Stanford HAITechnical Performance - 2025 AI Index ReportThis annual Stanford HAI report is widely cited by researchers and policymakers tracking AI capability trends; relevant to AI safety discussions about the pace of progress and the adequacy of current evaluation frameworks.The Stanford HAI 2025 AI Index Report documents rapid advances in AI technical performance, including accelerating benchmark saturation, convergence across frontier model capabi...capabilitiesevaluationbenchmarkstechnical-safety+3Source ↗ — Technical performance trends
- Evidently AI Agent Benchmarks Guide↗🔗 webAI Agent Benchmarks 2025Useful for AI safety researchers tracking capability advances in agentic LLMs; understanding current benchmarks helps identify where evaluation may lag behind real-world risk-relevant behaviors.A comprehensive overview of state-of-the-art benchmarks for evaluating AI agent capabilities, including multi-turn interactions, tool use, web navigation, and collaborative task...capabilitiesevaluationllmdeployment+2Source ↗
- AGI Inc. OSAgent — First superhuman computer control (Oct 2025)
References
The Stanford HAI 2025 AI Index Report documents rapid advances in AI technical performance, including accelerating benchmark saturation, convergence across frontier model capabilities, and the emergence of new reasoning paradigms. It provides a comprehensive empirical overview of where AI systems stand relative to human-level performance across diverse tasks. The report serves as a key annual reference for tracking the pace and direction of AI capability progress.
GAIA introduces a benchmark of 466 real-world questions requiring reasoning, multimodal handling, web browsing, and tool use, revealing a stark performance gap: humans achieve 92% accuracy while GPT-4 with plugins achieves only 15%. The paper argues that AGI development should prioritize human-level robustness on practical everyday tasks rather than superhuman performance on narrow professional domains.
This paper provides a comprehensive security analysis of LLM-based AI agents, identifying critical vulnerabilities arising from their tool-use and command-execution capabilities. The authors examine these vulnerabilities from a system security perspective and propose corresponding defense mechanisms, evaluating their effectiveness experimentally.
SWE-bench is a benchmark and leaderboard platform for evaluating AI models on real-world software engineering tasks, particularly resolving GitHub issues in open-source Python repositories. It offers multiple dataset variants (Lite, Verified, Multimodal) and standardized metrics to compare coding agents. It has become a widely-used standard for assessing the practical software engineering capabilities of LLM-based agents.
OpenAI announced function calling capabilities for GPT-4 and GPT-3.5-turbo, enabling models to intelligently output structured JSON to call external tools and APIs. The update also included longer context windows, cost reductions, and more steerable model versions, marking a significant expansion of LLM integration into agentic workflows.
A McKinsey practitioner-oriented guide for technology leaders on safely deploying agentic AI systems in enterprise contexts. The resource likely covers risk frameworks, security considerations, and governance practices for AI agents that can take autonomous actions. Content is inaccessible due to access restrictions.
7[2309.15817] Identifying the Risks of LM Agents with an LM-Emulated SandboxarXiv·Yangjun Ruan et al.·2023·Paper▸
ToolEmu is a framework that uses a language model to emulate tool execution for testing LM agents against safety risks, eliminating the need for manual tool implementation. It includes an automated LM-based safety evaluator and a benchmark of 36 high-stakes tools and 144 test cases. Results show even the safest LM agents fail nearly 24% of the time, revealing critical gaps before real-world deployment.
Google's Secure AI Framework (SAIF) 2.0 outlines Google's evolving strategy for securing AI systems against emerging threats, including adversarial attacks, model theft, and data poisoning. It provides a structured framework for organizations to assess and improve their AI security posture. The post introduces updated tools, guidelines, and best practices aligned with the latest AI deployment risks.
WebArena is a benchmark environment for evaluating autonomous web-browsing AI agents on realistic, long-horizon tasks across functional websites (e-commerce, forums, code repos, etc.). It tests agents' ability to complete complex multi-step goals requiring planning, navigation, and tool use in a self-hosted web ecosystem. The benchmark helps measure progress and identify limitations in agentic AI systems operating in realistic digital environments.
OSWorld is a benchmark for evaluating multimodal AI agents performing open-ended tasks in real computer environments across multiple operating systems. It tests agents' ability to use GUIs, execute code, and interact with real applications like browsers, file systems, and productivity tools. The benchmark reveals that current state-of-the-art models achieve very low success rates compared to humans, highlighting a significant capability gap.
Unit 42 (Palo Alto Networks) analyzes EchoLeak (CVE-2025-32711), a vulnerability in agentic AI systems that allows adversarial prompt injection via tool/function calls and API integrations, enabling data exfiltration and unauthorized actions. The research demonstrates how multi-step AI agents can be compromised through malicious content in external data sources, highlighting systemic risks in agentic architectures. It serves as a concrete case study in real-world AI security vulnerabilities.
OpenAI collaborated with human software developers to audit and filter the original SWE-bench benchmark, removing problematic or ambiguous test samples to create SWE-bench Verified. This improved benchmark provides more reliable and fair evaluations of AI models' ability to solve real-world software engineering tasks. It addresses concerns that inflated or misleading scores on the original benchmark obscured true model capabilities.
Official OpenAI documentation explaining how to enable language models to call external functions and APIs, allowing structured outputs and integration with real-world tools. It covers how to define functions, pass them to models, and handle model-generated function call responses. This capability is central to building AI agents and tool-augmented systems.
A comprehensive overview of state-of-the-art benchmarks for evaluating AI agent capabilities, including multi-turn interactions, tool use, web navigation, and collaborative tasks. The resource surveys how these benchmarks stress-test LLMs in realistic, complex scenarios to better measure practical performance. It serves as a reference guide for researchers and practitioners assessing agent progress.
SWE-bench Pro is a rigorous benchmark by Scale AI that evaluates AI agents on real-world software engineering tasks drawn from both public and private repositories. It addresses limitations of existing benchmarks by emphasizing realistic, challenging problem-solving scenarios. The leaderboard tracks and compares performance of leading AI coding agents.
Anthropic announces Claude Opus 4.5, their most capable model optimized for coding, agentic tasks, and computer use, with significantly reduced pricing ($5/$25 per million tokens). The model demonstrates state-of-the-art performance on software engineering benchmarks, long-horizon autonomous tasks, and multi-step reasoning while being notably more token-efficient than predecessors.