Skip to content
Longterm Wiki
Back

Measuring AI Ability to Complete Long Tasks - METR

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

Published by METR (Model Evaluation and Threat Research) in March 2025, this research is directly relevant to AI safety evaluations and informing thresholds for capability-based deployment decisions and governance frameworks.

Metadata

Importance: 78/100blog postprimary source

Summary

METR presents empirical research showing that AI models' ability to complete increasingly long autonomous tasks is growing exponentially, with the maximum task length that models can successfully complete roughly doubling every 7 months. This 'task length' metric serves as a practical proxy for measuring real-world AI capability progression and agentic autonomy.

Key Points

  • AI task completion horizon (the longest tasks models can reliably complete) has been doubling approximately every 7 months across recent frontier models.
  • The metric focuses on autonomous, multi-step task completion rather than narrow benchmarks, better reflecting real-world agentic capability.
  • Exponential growth in task length has significant implications for estimating when AI could perform complex, extended work autonomously including dangerous tasks.
  • This trajectory suggests AI agents capable of weeks-long autonomous work may arrive sooner than expected, raising urgent safety and governance concerns.
  • METR's approach provides a more practically meaningful capability metric than traditional benchmarks for tracking progress toward transformative AI.

Review

METR's research introduces an innovative approach to measuring AI capabilities by tracking the length of tasks generalist models can complete autonomously. By recording the time human experts take to complete various software and reasoning tasks, they developed a method to characterize AI models' performance across different task durations. Their key finding is a remarkably consistent exponential trend in AI task completion abilities, with a doubling time of around 7 months over the past six years. The study's significance lies in bridging the gap between benchmark performance and real-world utility, highlighting that current AI models excel at short tasks but struggle with complex, extended projects. By extrapolating their trend, the researchers predict that within a decade, AI agents might independently complete substantial software tasks currently requiring days or weeks of human effort. While acknowledging methodological limitations and potential measurement errors, their sensitivity analyses suggest the trend remains robust, with implications for AI development, forecasting, and risk management.

Cited by 8 pages

PageTypeQuality
Long-Horizon Autonomous TasksCapability65.0
Epoch AIOrganization51.0
METROrganization66.0
Capability ElicitationApproach91.0
Dangerous Capability EvaluationsApproach64.0
Scalable Eval ApproachesApproach65.0
Tool-Use RestrictionsApproach91.0
Emergent CapabilitiesRisk61.0
Resource ID: 271fc5f73a8304b2 | Stable ID: NzE4Y2Q4ZT