Capability Threshold Model
AI Capability Threshold Model
Comprehensive framework mapping AI capabilities across 5 dimensions to specific risk thresholds, finding authentication collapse/mass persuasion risks at 70-85% likelihood by 2027, bioweapons development at 40% by 2029, with critical thresholds estimated when models achieve 50% on complex reasoning benchmarks and cross expert-level domain knowledge. Provides concrete capability requirements, timeline projections, and early warning indicators across 7 major risk categories with extensive benchmark tracking.
Overview
Different AI risks require different capability levels to become dangerous. A system that can write convincing phishing emails poses different risks than one that can autonomously discover zero-day vulnerabilities. This model maps specific capability requirements to specific risks, helping predict when risks activate as capabilities improve.
The capability threshold model provides a structured framework for understanding how AI systems transition from relatively benign to potentially dangerous across multiple risk domains. Rather than treating AI capability as a single dimension or risks as uniformly dependent on general intelligence, this model recognizes that specific risks emerge when systems cross particular capability thresholds in relevant dimensions. According to the International AI Safety Report (October 2025)↗🔗 webInternational AI Safety Report (October 2025)This is an official interim update to the landmark International AI Safety Report, a major intergovernmental reference document on AI safety risks; highly relevant for those tracking global AI governance and the evolving capabilities landscape as of late 2025.A focused interim update to the International AI Safety Report, chaired by Yoshua Bengio, covering significant developments in AI capabilities and their risk implications betwee...ai-safetygovernancecapabilitiespolicy+4Source ↗, governance choices in 2025-2026 must internalize that capability scaling has decoupled from parameter count, meaning risk thresholds can be crossed between annual cycles.
Key findings include 15-25% benchmark performance indicating early risk emergence, 50% marking qualitative shifts to complex autonomous execution, and most critical thresholds estimated to cross between 2025-2029 across misuse, control, and structural risk categories. The Future of Life Institute's 2025 AI Safety Index↗🔗 web★★★☆☆Future of Life InstituteFLI AI Safety Index Summer 2025Published by the Future of Life Institute, this index provides a structured external audit of major AI labs' safety practices, useful for tracking industry accountability trends and identifying gaps between stated safety commitments and measurable actions.The Future of Life Institute's AI Safety Index Summer 2025 systematically evaluates leading AI companies on safety practices, finding widespread deficiencies across risk managem...ai-safetygovernanceevaluationexistential-risk+4Source ↗ reveals an industry struggling to keep pace with its own rapid capability advances, with companies claiming AGI achievement within the decade yet none scoring above D in existential safety planning.
Risk Impact Assessment
| Risk Category | Severity | Likelihood (2025-2027) | Threshold Crossing Timeline | Trend |
|---|---|---|---|---|
| Authentication Collapse | Critical | 85% | 2025-2027 | ↗ Accelerating |
| Mass Persuasion | High | 70% | 2025-2026 | ↗ Accelerating |
| Cyberweapon Development | High | 65% | 2025-2027 | ↗ Steady |
| Bioweapons Development | Critical | 40% | 2026-2029 | → Uncertain |
| Situational Awareness | Critical | 60% | 2025-2027 | ↗ Accelerating |
| Economic Displacement | High | 80% | 2026-2030 | ↗ Steady |
| Strategic Deception | Extreme | 15% | 2027-2035+ | → Uncertain |
Capability Dimensions Framework
AI capabilities decompose into five distinct dimensions that progress at different rates. Understanding these separately is crucial because different risks require different combinations. According to Epoch AI's tracking↗🔗 web★★★★☆Epoch AIEpoch AI: AI Trends & Metrics DashboardEpoch AI's trends dashboard is a key empirical reference for AI safety researchers assessing capability trajectories and compute scaling; useful for grounding risk assessments and policy arguments in observed data.Epoch AI's trends page provides data-driven tracking of key metrics in AI development, including compute scaling, model capabilities, and training trends. It serves as a quantit...capabilitiescomputeevaluationai-safety+2Source ↗, the training compute of frontier AI models has grown by 5x per year since 2020, and the Epoch Capabilities Index shows frontier model improvement nearly doubled in 2024, from ~8 points/year to ~15 points/year.
Diagram (loading…)
flowchart TD
subgraph DIMS["Capability Dimensions"]
DK[Domain Knowledge] --> RISK
RD[Reasoning Depth] --> RISK
PH[Planning Horizon] --> RISK
SM[Strategic Modeling] --> RISK
AE[Autonomous Execution] --> RISK
end
subgraph RISK["Risk Activation Thresholds"]
AUTH[Authentication Collapse<br/>Threshold: 2025-2027]
BIO[Bioweapons Uplift<br/>Threshold: 2026-2029]
CYBER[Cyberweapons<br/>Threshold: 2025-2027]
SCHEME[Strategic Deception<br/>Threshold: 2027-2035+]
end
RISK --> AUTH
RISK --> BIO
RISK --> CYBER
RISK --> SCHEME
style AUTH fill:#ffcccc
style BIO fill:#ffcccc
style CYBER fill:#ffddcc
style SCHEME fill:#ffe6cc| Dimension | Level 1 | Level 2 | Level 3 | Level 4 | Current Frontier | Gap to Level 3 |
|---|---|---|---|---|---|---|
| Domain Knowledge | Undergraduate | Graduate | Expert | Superhuman | Expert- (some domains) | 0.5 levels |
| Reasoning Depth | Simple (2-3 steps) | Moderate (5-10) | Complex (20+) | Superhuman | Moderate+ | 0.5-1 level |
| Planning Horizon | Immediate | Short-term (hrs) | Medium (wks) | Long-term (months) | Short-term+ | 1 level |
| Strategic Modeling | None | Basic | Sophisticated | Superhuman | Basic+ | 1-1.5 levels |
| Autonomous Execution | None | Simple tasks | Complex tasks | Full autonomy | Simple-Complex | 0.5-1 level |
Domain Knowledge Benchmarks
Current measurement approaches show significant gaps in assessing practical domain expertise:
| Domain | Best Benchmark | Current Frontier Score | Expert Human Level | Assessment Quality |
|---|---|---|---|---|
| Biology | MMLU-Biology↗📄 paper★★★☆☆arXiv[2009.03300] Measuring Massive Multitask Language UnderstandingMMLU is a foundational capability benchmark widely used to track LLM progress; relevant to AI safety for understanding capability levels and evaluating whether models meet thresholds for safe deployment.Dan Hendrycks, Collin Burns, Steven Basart et al. (2020)7,359 citationsIntroduces the MMLU benchmark, a comprehensive evaluation suite covering 57 subjects across STEM, humanities, social sciences, and more, designed to measure breadth and depth of...capabilitiesevaluationllmbenchmarks+2Source ↗ | 85-90% | ≈95% | Medium |
| Chemistry | ChemBench↗📄 paper★★★☆☆arXiv[2310.09049] SAI: Solving AI Tasks with Systematic Artificial Intelligence in Communication NetworkThis paper is tangentially relevant to AI safety, focusing primarily on applying LLMs to communication network tasks; its relevance to the safety knowledge base is limited to capability evaluation and deployment risk considerations in critical infrastructure contexts.Lei Yao, Yong Zhang, Zilong Yan et al. (2023)4 citationsThis paper proposes SAI, a systematic AI framework for solving diverse AI tasks in communication networks by integrating large language models with structured reasoning approach...capabilitiesllmevaluationdeployment+2Source ↗ | 70-80% | ≈90% | Low |
| Computer Security | SecBench↗🔗 web★★★☆☆GitHubSecBench: Security Capability BenchmarkSecBench is a GitHub organization focused on security benchmarking; with no content available, its exact scope is uncertain, but it likely relates to capability thresholds and risk assessment frameworks relevant to AI safety evaluations.SecBench appears to be a GitHub organization focused on security benchmarking, likely providing standardized evaluation frameworks for assessing AI or software security capabili...capabilitiesevaluationred-teamingrisk-assessment+3Source ↗ | 65-75% | ≈85% | Low |
| Psychology | MMLU-Psychology | 80-85% | ≈90% | Very Low |
| Medicine | MedQA↗📄 paper★★★☆☆arXiv[2009.13081] What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical ExamsMedQA is a large-scale multilingual medical question-answering dataset from professional exams that can be used to evaluate AI models' medical reasoning capabilities and potential for deployment in healthcare contexts, relevant for assessing AI safety in high-stakes domains.Di Jin, Eileen Pan, Nassim Oufattole et al. (2020)66 citationsMedQA is the first free-form multiple-choice open domain question answering dataset for medical problems, sourced from professional medical board exams across three languages: E...capabilitythresholdrisk-assessmentSource ↗ | 85-90% | ≈95% | Medium |
Assessment quality reflects how well benchmarks capture practical expertise versus academic knowledge.
Reasoning Depth Progression
The ARC Prize 2024-2025 results↗🔗 webARC Prize 2024-2025 resultsARC Prize is a major AI benchmark competition testing abstract reasoning; its results are closely watched by the AI safety community as a potential indicator of progress toward AGI-level capabilities and associated risk thresholds.Comprehensive analysis of the ARC Prize competition results for 2024-2025, evaluating AI systems' performance on the Abstraction and Reasoning Corpus (ARC) benchmark designed to...capabilitiesevaluationai-safetytechnical-safety+1Source ↗ demonstrate the critical threshold zone for complex reasoning. On ARC-AGI-1, OpenAI's o3-preview achieved 75.7% accuracy (near human level of 98%), while on the harder ARC-AGI-2 benchmark, even advanced models score only single-digit percentages, yet humans can solve every task.
| Reasoning Level | Benchmark Examples | Current Performance | Risk Relevance |
|---|---|---|---|
| Simple (2-3 steps) | Basic math word problems | 95%+ | Low-risk applications |
| Moderate (5-10 steps) | GSM8K↗📄 paper★★★☆☆arXiv[2110.14168] Training Verifiers to Solve Math Word ProblemsIntroduces GSM8K dataset and analyzes language model failures on multi-step mathematical reasoning, relevant to AI safety research on robustness and reliability of large language models in structured reasoning tasks.Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian et al. (2021)GSM8K is a benchmark dataset of 8.5K high-quality grade school math word problems designed to evaluate multi-step mathematical reasoning in language models. The paper demonstrat...capabilitiestrainingllmcapability+1Source ↗, multi-hop QA | 85-95% | Most current capabilities |
| Complex (20+ steps) | ARC-AGI↗📄 paper★★★☆☆arXivOn the Measure of IntelligenceFoundational work on defining and measuring intelligence, critical for AI safety as it addresses how we evaluate and compare AI systems' capabilities and safety relevant properties.François Chollet (2019)This paper argues that current AI benchmarking practices, which measure skill at specific tasks like games, fail to capture true intelligence because skill can be artificially i...capabilitiestrainingevaluationcapability+1Source ↗, extended proofs | 30-75% (ARC-AGI-1), 5-55% (ARC-AGI-2) | Critical threshold zone |
| Superhuman | Novel mathematical proofs | <10% | Advanced risks |
Recent breakthrough (December 2025): Poetiq with GPT-5.2 X-High↗🔗 webARC Prize 2024-2025 resultsARC Prize is a major AI benchmark competition testing abstract reasoning; its results are closely watched by the AI safety community as a potential indicator of progress toward AGI-level capabilities and associated risk thresholds.Comprehensive analysis of the ARC Prize competition results for 2024-2025, evaluating AI systems' performance on the Abstraction and Reasoning Corpus (ARC) benchmark designed to...capabilitiesevaluationai-safetytechnical-safety+1Source ↗ achieved 75% on ARC-AGI-2, surpassing the average human test-taker score of 60% for the first time, demonstrating rapid progress on complex reasoning tasks.
Risk-Capability Mapping
Near-Term Risks (2025-2027)
Authentication Collapse
The volume of deepfakes has grown explosively: Deloitte's 2024 analysis↗🔗 webDeloitte's 2024 analysisIndustry analyst report from Deloitte relevant to AI governance and deployment policy discussions; useful for understanding economic and societal implications of generative AI misuse, particularly around synthetic media and content authentication standards.Deloitte's 2024 analysis frames deepfakes as a cybersecurity-scale threat to online trust, projecting the deepfake detection market will grow 42% annually from $5.5B in 2023 to ...governancedeploymentpolicycapabilities+3Source ↗ estimates growth from roughly 500,000 online deepfakes in 2023 to about 8 million in 2025, with annual growth nearing 900%. Voice cloning has crossed what experts call the "indistinguishable threshold"--a few seconds of audio now suffice to generate a convincing clone.
| Capability | Required Level | Current Level | Gap | Evidence |
|---|---|---|---|---|
| Domain Knowledge (Media) | Expert | Expert- | 0.5 level | Sora quality↗🔗 web★★★★☆OpenAISora – OpenAI Text-to-Video Generation AppSora is relevant to AI safety discussions around capabilities proliferation, deepfake risks, misuse potential of generative video, and the pace of frontier model deployment by major labs.Sora is OpenAI's text-to-video generation model and app that converts text prompts or images into videos with high realism, including automatic sound. It supports features like ...capabilitiesdeploymentmultimodalgenerative-ai+2Source ↗ approaching photorealism |
| Reasoning Depth | Moderate | Moderate | 0 levels | Current models handle multi-step generation |
| Strategic Modeling | Basic+ | Basic | 0.5 level | Limited theory of mind in current systems |
| Autonomous Execution | Simple | Simple | 0 levels | Already achieved for content generation |
Key Threshold Capabilities:
- Generate synthetic content indistinguishable from authentic across all modalities
- Real-time interactive video generation (NVIDIA Omniverse↗🔗 webNVIDIA Omniverse PlatformOmniverse is primarily an industrial/commercial platform but is relevant to AI safety discussions around simulation-based training, digital twins, and the compute infrastructure underpinning advanced AI capabilities in robotics and autonomous systems.NVIDIA Omniverse is a platform for building and operating metaverse applications, enabling real-time 3D simulation, collaboration, and digital twin creation. It provides tools f...capabilitiesdeploymentcomputeevaluation+1Source ↗)
- Defeat detection systems designed to identify AI content
- Mimic individual styles from minimal samples
Detection Challenges: OpenAI's deepfake detection tool↗🔗 webOpenAI's deepfake detection toolUseful for practitioners and policy researchers tracking the state of deepfake countermeasures; relevant to AI misuse risk assessment and content authenticity governance discussions.This resource surveys leading AI-powered deepfake detection tools available in 2025, including OpenAI's detection tool, evaluating their capabilities for identifying synthetical...capabilitiesdeploymentevaluationred-teaming+3Source ↗ identifies DALL-E 3 images with 98.8% accuracy but only flags 5-10% of images from other AI tools. Multi-modal attacks combining deepfaked video, synthetic voices, and fabricated documents are increasing.
Current Status: OpenAI's Sora↗🔗 web★★★★☆OpenAISora – OpenAI Text-to-Video Generation AppSora is relevant to AI safety discussions around capabilities proliferation, deepfake risks, misuse potential of generative video, and the pace of frontier model deployment by major labs.Sora is OpenAI's text-to-video generation model and app that converts text prompts or images into videos with high realism, including automatic sound. It supports features like ...capabilitiesdeploymentmultimodalgenerative-ai+2Source ↗ and Meta's Make-A-Video↗🔗 webMeta Make-A-Video: Text-to-Video AI SystemA demonstration of Meta's text-to-video AI capability, relevant to AI safety discussions around synthetic media risks, capability thresholds, and the governance challenges posed by increasingly realistic generative AI systems.Meta's Make-A-Video is an AI system that generates short video clips from text descriptions, images, or existing videos. It extends text-to-image generation techniques into the ...capabilitiesdeploymentgovernancerisk-assessment+2Source ↗ demonstrate near-threshold video generation. ElevenLabs↗🔗 webElevenLabs - AI Voice Generation PlatformElevenLabs is a major commercial AI voice platform relevant to AI safety discussions around synthetic media, voice cloning misuse, deepfake audio, and the governance challenges posed by increasingly accessible identity-spoofing technologies.ElevenLabs is a leading AI voice technology platform offering text-to-speech, voice cloning, speech-to-text, and AI agent capabilities across 70+ languages. It serves enterprise...capabilitiessynthetic-mediadeploymentgovernance+2Source ↗ achieves voice cloning from <30 seconds of audio.
Mass Persuasion Capabilities
| Capability | Required Level | Current Level | Gap | Evidence |
|---|---|---|---|---|
| Domain Knowledge (Psychology) | Graduate+ | Graduate | 0.5 level | Strong performance on psychology benchmarks |
| Strategic Modeling | Sophisticated | Basic+ | 1 level | Limited multi-agent reasoning |
| Planning Horizon | Medium-term | Short-term | 1 level | Cannot maintain campaigns over weeks |
| Autonomous Execution | Simple | Simple | 0 levels | Can generate content at scale |
Research Evidence:
- Anthropic (2024)↗🔗 web★★★★☆AnthropicClaude 3 Model Card (Anthropic, 2024)Official Anthropic model card for Claude 3; key reference for understanding how a leading AI lab evaluates and communicates safety properties and capability thresholds for a frontier model under a formal responsible scaling policy.Anthropic's official model card for the Claude 3 family (Haiku, Sonnet, Opus), documenting capability evaluations, safety assessments, and alignment properties. It covers fronti...capabilitiesevaluationred-teamingdeployment+5Source ↗ shows Claude 3 achieves 84% on psychology benchmarks
- Stanford HAI study↗🔗 web★★★★☆Stanford HAIHumans Are More Likely to Believe Messages from AI (Stanford HAI)Relevant to AI safety discussions around misuse, disinformation, and the social risks of deploying persuasive AI systems; underscores why transparency and content labeling policies matter.A Stanford HAI study examines how people respond to messages they believe are generated by AI versus humans, finding that individuals tend to place higher credibility or trust i...ai-safetydeploymentgovernancepolicy+3Source ↗ finds AI-generated content 82% higher believability
- MIT persuasion study↗📄 paper★★★★★Science (peer-reviewed)MIT persuasion studyEmpirical study examining humans' ability to distinguish AI-generated content (GPT-3) from authentic information and detect disinformation, directly relevant to understanding AI capabilities in content generation and information integrity risks.G. Spitale, N. Biller-Andorno, Federico Germani (2023)220 citationsThis MIT study examined whether humans can distinguish between accurate and false information in tweets, and whether they can identify AI-generated content from GPT-3 versus hum...evaluationllmcapabilitythreshold+1Source ↗ demonstrates automated A/B testing improves persuasion by 35%
Medium-Term Risks (2026-2029)
Bioweapons Development
| Capability | Required Level | Current Level | Gap | Assessment Source |
|---|---|---|---|---|
| Domain Knowledge (Biology) | Expert | Graduate+ | 1 level | RAND biosecurity assessment↗🔗 web★★★★☆RAND CorporationRAND Corporation studyRAND research reports on AI and bioweapons risk are directly relevant to frontier AI evaluation policy, particularly debates around capability thresholds used in safety frameworks like Anthropic's RSP or OpenAI's preparedness framework.This RAND Corporation research report examines the risk of AI systems providing meaningful uplift to actors seeking to develop biological weapons, focusing on how to assess capa...existential-riskevaluationred-teamingcapabilities+6Source ↗ |
| Domain Knowledge (Chemistry) | Expert | Graduate | 1-2 levels | Limited synthesis knowledge |
| Reasoning Depth | Complex | Moderate+ | 1 level | Cannot handle 20+ step procedures |
| Planning Horizon | Medium-term | Short-term | 1 level | No multi-week experimental planning |
| Autonomous Execution | Complex | Simple+ | 1 level | Cannot troubleshoot failed experiments |
Critical Bottlenecks:
- Specialized synthesis knowledge for dangerous compounds
- Autonomous troubleshooting of complex laboratory procedures
- Multi-week experimental planning and adaptation
- Integration of theoretical knowledge with practical constraints
Expert Assessment: RAND Corporation (2024)↗🔗 web★★★★☆RAND CorporationRAND Corporation studyRAND research reports on AI and bioweapons risk are directly relevant to frontier AI evaluation policy, particularly debates around capability thresholds used in safety frameworks like Anthropic's RSP or OpenAI's preparedness framework.This RAND Corporation research report examines the risk of AI systems providing meaningful uplift to actors seeking to develop biological weapons, focusing on how to assess capa...existential-riskevaluationred-teamingcapabilities+6Source ↗ estimates 60% probability of crossing threshold by 2028.
Economic Displacement Thresholds
McKinsey's research↗🔗 web★★★☆☆McKinsey & CompanyAgents, Robots, and Us: Skill Partnerships in the Age of AI (McKinsey MGI)Content was inaccessible due to server restrictions; metadata is inferred from the URL, truncated title, and McKinsey MGI's known research focus. Treat specific statistics (e.g., '57%') with caution until verified from the primary source.A McKinsey Global Institute report examining how AI agents and robotics are reshaping labor markets and workforce skills. The report reportedly finds that 57% of workers may nee...labor-marketsautomationdeploymentgovernance+3Source ↗ indicates that current technologies could automate about 57% of U.S. work hours in theory. By 2030, approximately 27% of current work hours in Europe and 30% in the United States could be automated. Workers in lower-wage jobs are up to 14 times more likely to need to change occupations than those in highest-wage positions.
| Job Category | Automation Threshold | Current AI Capability | Estimated Timeline | Source |
|---|---|---|---|---|
| Content Writing | 70% task automation | 85% | Crossed 2024 | McKinsey AI Index↗🔗 web★★★☆☆McKinsey & CompanyMcKinsey State of AI in 2024Useful as an industry benchmark for understanding real-world AI adoption trends; content inaccessible at time of indexing due to access restrictions, so details are inferred from the report's known structure and prior editions.McKinsey's annual survey-based report tracking enterprise AI adoption, investment trends, and emerging risks across industries. The report provides quantitative benchmarks on ho...capabilitiesgovernancedeploymentevaluation+1Source ↗ |
| Code Generation | 60% task automation | 60-70% (SWE-bench Verified) | Crossed 2025 | SWE-bench leaderboard↗🔗 webSWE-bench Official LeaderboardsSWE-bench is a key industry benchmark for tracking AI coding agent capabilities; useful for understanding the pace of progress in autonomous software engineering, which has implications for AI-assisted research and recursive self-improvement risks.SWE-bench is a benchmark and leaderboard platform for evaluating AI models on real-world software engineering tasks, particularly resolving GitHub issues in open-source Python r...capabilitiesevaluationagentictool-use+3Source ↗ |
| Data Analysis | 75% task automation | 55% | 2026-2027 | Industry surveys |
| Customer Service | 80% task automation | 70% | 2025-2026 | Salesforce AI reports↗🔗 webSalesforce AI reportsAn industry report from Salesforce relevant as a data point on real-world AI deployment velocity and commercial capability trends, though not a primary AI safety research source. Tangentially relevant to discussions of AI deployment thresholds and risk assessment in enterprise contexts.Salesforce reports on AI adoption trends in customer service, highlighting how businesses are deploying AI tools to automate interactions, improve efficiency, and manage custome...capabilitiesdeploymentgovernanceai-safety+1Source ↗ |
| Legal Research | 65% task automation | 40% | 2027-2028 | Legal industry analysis |
Coding Benchmark Update: The International AI Safety Report (October 2025)↗🔗 webInternational AI Safety Report (October 2025)This is an official interim update to the landmark International AI Safety Report, a major intergovernmental reference document on AI safety risks; highly relevant for those tracking global AI governance and the evolving capabilities landscape as of late 2025.A focused interim update to the International AI Safety Report, chaired by Yoshua Bengio, covering significant developments in AI capabilities and their risk implications betwee...ai-safetygovernancecapabilitiespolicy+4Source ↗ notes that coding capabilities have advanced particularly quickly. Top models now solve over 60% of problems in SWE-bench Verified, up from 40% in late 2024 and almost 0% at the beginning of 2024. However, Scale AI's SWE-Bench Pro↗🔗 webScale AI's SWE-Bench ProUseful reference for tracking AI coding capabilities and benchmark integrity; relevant to researchers monitoring AI progress thresholds in software engineering as a leading indicator of more autonomous AI systems.Scale AI introduces SWE-Bench Pro, an enhanced version of the SWE-Bench coding benchmark designed to address limitations in existing evaluations of AI software engineering capab...capabilitiesevaluationbenchmarkstechnical-safety+2Source ↗ shows a significant performance drop: even the best models (GPT-5, Claude Opus 4.1) score only 23% on harder, more realistic tasks.
Long-Term Control Risks (2027-2035+)
Strategic Deception (Scheming)
| Capability | Required Level | Current Level | Gap | Uncertainty |
|---|---|---|---|---|
| Strategic Modeling | Superhuman | Basic+ | 2+ levels | Very High |
| Reasoning Depth | Complex | Moderate+ | 1 level | High |
| Planning Horizon | Long-term | Short-term | 2 levels | Very High |
| Situational Awareness | Expert | Basic | 2 levels | High |
Key Uncertainties:
- Whether sophisticated strategic modeling can emerge from current training approaches
- Detectability of strategic deception capabilities during evaluation
- Minimum capability level required for effective scheming
Research Evidence:
- Anthropic Constitutional AI↗📄 paper★★★☆☆arXivConstitutional AI: Harmlessness from AI FeedbackConstitutional AI paper presenting a method for training AI systems to be harmless using AI feedback based on a set of constitutional principles, addressing a fundamental challenge in AI alignment and safety.Yanuo Zhou (2025)2,673 citationsanthropickb-sourceSource ↗ shows limited success in detecting deceptive behavior
- Redwood Research↗🔗 webRedwood Research: AI ControlRedwood Research is one of the leading technical AI safety organizations; their AI control framework and alignment faking research are frequently cited in both academic and policy discussions on managing risks from advanced AI systems.Redwood Research is a nonprofit AI safety organization that pioneered the 'AI control' research agenda, focusing on preventing intentional subversion by misaligned AI systems. T...ai-safetyalignmenttechnical-safetyred-teaming+5Source ↗ adversarial training reveals capabilities often hidden during evaluation
Current State & Trajectory
Capability Progress Rates
According to Epoch AI's analysis↗🔗 web★★★★☆Epoch AITraining Compute of Frontier AI Models Grows by 4-5x Per Year (Epoch AI)This Epoch AI analysis is frequently cited in AI governance and safety contexts to justify compute-based thresholds in policy proposals, such as executive orders requiring reporting above certain FLOP counts.Epoch AI analyzes historical trends in the training compute used for frontier AI models, finding that compute has grown approximately 4-5x per year. This rapid scaling has signi...computecapabilitiesai-safetygovernance+3Source ↗, training compute for frontier models grows 4-5x yearly. Their Epoch Capabilities Index shows frontier model improvement nearly doubled in 2024. METR's research↗🔗 web★★★★☆METRMETR Capability Evaluations Update: Claude Sonnet and OpenAI o1This METR update is directly relevant to responsible scaling policy debates, providing empirical capability assessments used by labs to determine whether frontier models cross safety-relevant autonomy thresholds before deployment.METR presents updated capability evaluations for Claude Sonnet and OpenAI o1 models, assessing whether these frontier AI systems approach autonomy thresholds relevant to safety ...evaluationcapabilitiesred-teamingai-safety+3Source ↗ shows AI performance on task length has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months.
| Dimension | 2023-2024 Progress | Projected 2024-2025 | Key Drivers |
|---|---|---|---|
| Domain Knowledge | +0.5 levels | +0.3-0.7 levels | Larger training datasets, specialized fine-tuning |
| Reasoning Depth | +0.3 levels | +0.2-0.5 levels | Chain-of-thought improvements, tree search |
| Planning Horizon | +0.2 levels | +0.2-0.4 levels | Tool integration, memory systems |
| Strategic Modeling | +0.1 levels | +0.1-0.3 levels | Multi-agent training, RL improvements |
| Autonomous Execution | +0.4 levels | +0.3-0.6 levels | Tool use, real-world deployment |
Data Sources: Epoch AI capability tracking↗🔗 web★★★★☆Epoch AIEpoch AI - AI Research and Forecasting OrganizationEpoch AI is a key reference organization for empirical data on AI scaling trends; their compute and training run databases are widely cited in AI safety and governance discussions.Epoch AI is a research organization focused on investigating and forecasting trends in artificial intelligence, particularly around compute, training data, and algorithmic progr...capabilitiescomputegovernancepolicy+4Source ↗, industry benchmark results, expert elicitation.
Compute Scaling Projections
| Metric | Current (2025) | Projected 2027 | Projected 2030 | Source |
|---|---|---|---|---|
| Models above 10^26 FLOP | ≈5-10 | ≈30 | ≈200+ | Epoch AI model counts↗🔗 web★★★★☆Epoch AIEpoch AI projectionsUseful reference for AI governance discussions about compute-based regulation; provides quantitative estimates of how many models fall above specific FLOP thresholds used in major regulatory proposals.Epoch AI analyzes how many AI models would fall above various compute thresholds (measured in FLOPs), providing empirical projections relevant to governance frameworks that use ...computegovernancepolicycapabilities+3Source ↗ |
| Largest training run power | 1-2 GW | 2-4 GW | 4-16 GW | Epoch AI power analysis↗🔗 web★★★★☆Epoch AIEpoch AI power analysisRelevant to AI governance and compute governance discussions; provides empirical grounding for debates about resource constraints on frontier AI development and the feasibility of compute-based regulatory thresholds.Epoch AI analyzes the rapidly growing electricity demands of training frontier AI models, examining trends in power consumption, infrastructure constraints, and implications for...computecapabilitiesgovernancepolicy+3Source ↗ |
| Frontier model training cost | $100M-500M | $100M-1B+ | $1-5B | Epoch AI cost projections |
| Open-weight capability lag | 6-12 months | 6-12 months | 6-12 months | Epoch AI consumer GPU analysis↗🔗 web★★★★☆Epoch AIEpoch AI consumer GPU analysisRelevant to AI safety discussions about the democratization of frontier capabilities, accessibility of powerful models outside institutional settings, and implications for governance and deployment risk timelines.Epoch AI analyzes how consumer GPUs like the RTX 5090 can run open-weight models that match frontier LLM performance from 6-12 months prior. The analysis tracks this gap across ...capabilitiescomputeevaluationdeployment+1Source ↗ |
Leading Organizations
| Organization | Strongest Capabilities | Estimated Timeline to Next Threshold | Focus Area |
|---|---|---|---|
| OpenAI↗🔗 web★★★★☆OpenAIOpenAI Official HomepageOpenAI is a central organization in the AI safety and capabilities landscape; this homepage links to their models, research publications, and policy positions, making it a useful reference point for tracking frontier AI development.OpenAI is a leading AI research and deployment company focused on building advanced AI systems, including GPT and o-series models, with a stated mission of ensuring artificial g...capabilitiesalignmentgovernancedeployment+5Source ↗ | Domain knowledge, autonomous execution | 12-18 months | General capabilities |
| Anthropic↗🔗 web★★★★☆AnthropicAnthropic - AI Safety Company HomepageAnthropic is a primary institutional actor in AI safety; understanding their research agenda and deployment philosophy is relevant context for the broader AI safety ecosystem, though this homepage itself is a reference point rather than a primary technical resource.Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its famil...ai-safetyalignmentcapabilitiesinterpretability+6Source ↗ | Reasoning depth, strategic modeling | 18-24 months | Safety-focused development |
| DeepMind↗🔗 web★★★★☆Google DeepMindGoogle DeepMind Official HomepageGoogle DeepMind is a major frontier AI lab whose research and policies are highly relevant to AI safety; this homepage provides entry point to their publications, safety frameworks, and organizational positions on AI risk.Google DeepMind is a leading AI research laboratory combining the former DeepMind and Google Brain teams, focused on developing advanced AI systems and conducting research acros...capabilitiesai-safetygovernancealignment+4Source ↗ | Strategic modeling, planning | 18-30 months | Scientific applications |
| Meta↗🔗 web★★★★☆Meta AIPublic statements 2024This is Meta's official AI landing page; useful as a primary source for tracking Meta's public AI strategy, product announcements, and stated alignment priorities, but contains minimal technical depth on safety methodology.Meta's official AI homepage showcases their broad research and product portfolio including Llama 4 (large language models), Segment Anything Model 3 (computer vision), V-JEPA 2 ...capabilitiesalignmentdeploymentgovernance+3Source ↗ | Multimodal generation | 6-12 months | Social/media applications |
Key Uncertainties & Research Cruxes
Measurement Validity
The Berkeley CLTC Working Paper on Intolerable Risk Thresholds↗🔗 webBerkeley CLTC Working Paper on Intolerable Risk ThresholdsPublished November 2024 by Berkeley's CLTC, this working paper is relevant to policymakers and safety researchers working on AI red lines, model evaluations, and governance frameworks for frontier AI systems.This Berkeley Center for Long-Term Cybersecurity working paper examines how to define and operationalize 'intolerable risk' thresholds for AI systems, providing a framework for ...ai-safetygovernancepolicyevaluation+5Source ↗ notes that models effectively more capable than the latest tested model (4x or more in Effective Compute or 6 months worth of fine-tuning) require comprehensive assessment including threat model mapping, empirical capability tests, elicitation testing without safety mechanisms, and likelihood forecasting.
An interdisciplinary review of AI evaluation↗📄 paper★★★☆☆arXivinterdisciplinary review of AI evaluationInterdisciplinary review examining how AI benchmarks are used to evaluate model performance, capabilities, and safety, with attention to their growing influence on AI development and regulatory frameworks and concerns about evaluating high-impact capabilities.Maria Eriksson, Erasmo Purificato, Arman Noroozian et al. (2025)This interdisciplinary meta-review of ~100 studies examines critical shortcomings in quantitative AI benchmarking practices over the past decade. The paper identifies fine-grain...evaluationSource ↗ highlights the "benchmark lottery" problem: researchers at Google's Brain Team found that many factors other than fundamental algorithmic superiority may lead to a method being perceived as superior. Ironically, a majority of influential benchmarks have been released without rigorous peer review.
| Uncertainty | Impact if True | Impact if False | Current Evidence |
|---|---|---|---|
| Current benchmarks accurately measure risk-relevant capabilities | Can trust threshold predictions | Need fundamentally new evaluations | Mixed - good for some domains, poor for others |
| Practical capabilities match benchmark performance | Smooth transition from lab to deployment | Significant capability overhangs | Substantial gaps observed in real-world deployment |
| Capability improvements follow predictable scaling laws | Reliable timeline forecasting possible | Threshold crossings may surprise | Scaling laws hold for some capabilities, not others |
Threshold Sharpness
Sharp Threshold Evidence:
- Authentication systems↗📄 paper★★★☆☆arXivAuthentication systemsThis arxiv preprint demonstrates curriculum learning applied to multimodal deep learning for post-disaster analytics, relevant to AI safety through its focus on improving model robustness and reliability in high-stakes emergency response scenarios.Huseyin Fuat Alsan, Taner Arsan (2023)This paper proposes a curriculum learning approach for post-disaster analytics using multimodal deep learning models that jointly process images and text. The authors introduce ...capabilitiestrainingeconomiccapability+1Source ↗: Detection accuracy drops from 95% to 15% once generation quality crosses threshold
- Economic viability: McKinsey automation analysis↗🔗 web★★★☆☆McKinsey & CompanyMcKinsey State of AI in 2024Useful as an industry benchmark for understanding real-world AI adoption trends; content inaccessible at time of indexing due to access restrictions, so details are inferred from the report's known structure and prior editions.McKinsey's annual survey-based report tracking enterprise AI adoption, investment trends, and emerging risks across industries. The report provides quantitative benchmarks on ho...capabilitiesgovernancedeploymentevaluation+1Source ↗ shows 10-20% capability improvements create 50-80% cost advantage in many tasks
- Security vulnerabilities: Most exploits require complete capability to work at all
Gradual Scaling Evidence:
- Job displacement: Different tasks within roles automate at different rates
- Persuasion effectiveness: Incremental improvements in messaging quality yield incremental persuasion gains
- Domain expertise: Knowledge accumulation appears continuous rather than threshold-based
Strategic Deception Detection
Critical unsolved problems in capability assessment:
| Challenge | Current Approach | Limitation | Research Need |
|---|---|---|---|
| Detecting hidden capabilities | Red-teaming, adversarial prompts | Systems may hide abilities during evaluation | Novel evaluation paradigms |
| Measuring situational awareness | Direct questioning, inference tests | Pattern matching vs. genuine understanding | Causal intervention studies |
| Long-term planning assessment | Multi-step benchmarks | Maximum tested horizon ~hours | Extended evaluation infrastructure |
| Strategic modeling depth | Theory of mind tasks | Unclear if results generalize | Ecological validity research |
Timeline Projections & Warning Indicators
2025 Critical Thresholds
High Probability (70%+):
- Authentication Collapse: Real-time deepfakes become commercially viable
- Content Generation Saturation: Human-level quality across all text/image modalities
Medium Probability (40-70%):
- Code Automation: 50%+ of software engineering tasks automated
- Basic Situational Awareness: Systems understand evaluation vs. deployment contexts
2026-2027 Medium Probability Events
| Risk | Probability | Key Indicators to Monitor |
|---|---|---|
| Advanced Cyberweapons | 65% | METR evaluations↗🔗 web★★★★☆METRMETR: Model Evaluation and Threat ResearchMETR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvem...evaluationred-teamingcapabilitiesai-safety+5Source ↗ crossing 40% threshold |
| Economic Disruption | 80% | GitHub Copilot↗🔗 web★★★☆☆GitHubGitHub Copilot - AI Coding AssistantGitHub Copilot is a widely deployed AI coding tool relevant to AI safety discussions around agentic systems, autonomous code generation, and the governance challenges of large-scale AI deployment in critical software infrastructure.GitHub Copilot is an AI-powered coding assistant that integrates into IDEs, terminals, and GitHub workflows to assist developers with code completion, autonomous agent-based cod...capabilitiesdeploymentai-safetygovernance+2Source ↗ achieving >60% task completion |
| Bioweapons Uplift | 40% | Expert biology benchmarks >90%, synthesis knowledge tests |
| Oversight Evasion | 60% | Detection of training/deployment context awareness |
Early Warning System
Red Flag Indicators:
- Sudden benchmark improvements >20 percentage points
- Systems developing capabilities not explicitly trained for
- Gap between capability and safety evaluation results widening
- Evidence of strategic behavior during evaluation
Monitoring Infrastructure:
- METR↗🔗 web★★★★☆METRMETR: Model Evaluation and Threat ResearchMETR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvem...evaluationred-teamingcapabilitiesai-safety+5Source ↗ dangerous capability evaluations
- MIRI↗🔗 web★★★☆☆MIRIMachine Intelligence Research InstituteMIRI is a foundational organization in the AI safety ecosystem; its research agenda and publications have significantly shaped the field's early theoretical frameworks.MIRI is a nonprofit research organization focused on ensuring that advanced AI systems are safe and beneficial. It conducts technical research on the mathematical foundations of...ai-safetyalignmentexistential-risktechnical-safety+2Source ↗ alignment evaluation protocols
- Industry responsible scaling policies (OpenAI Preparedness↗🔗 web★★★★☆OpenAIOpenAI Preparedness FrameworkThis is OpenAI's official Preparedness Framework page, relevant to discussions of frontier AI governance, deployment safety standards, and how leading labs operationalize risk thresholds before releasing powerful models.OpenAI's Preparedness initiative outlines a framework for tracking, evaluating, and mitigating catastrophic risks from frontier AI models. It establishes risk thresholds across ...ai-safetygovernanceevaluationred-teaming+5Source ↗, Anthropic RSP↗🔗 web★★★★☆AnthropicResponsible Scaling PolicyThis is Anthropic's foundational policy document establishing how it gates deployment of increasingly capable models; a key reference for understanding industry-led AI governance frameworks and voluntary safety commitments.Anthropic introduces its Responsible Scaling Policy (RSP), a framework of technical and organizational protocols for managing catastrophic risks as AI systems become more capabl...governancepolicyai-safetycapabilities+6Source ↗)
- Academic capability forecasting (Epoch AI↗🔗 web★★★★☆Epoch AIEpoch AI - AI Research and Forecasting OrganizationEpoch AI is a key reference organization for empirical data on AI scaling trends; their compute and training run databases are widely cited in AI safety and governance discussions.Epoch AI is a research organization focused on investigating and forecasting trends in artificial intelligence, particularly around compute, training data, and algorithmic progr...capabilitiescomputegovernancepolicy+4Source ↗)
The METR Common Elements Report (December 2025)↗🔗 web★★★★☆METRMETR's analysis of 12 companiesPublished by METR (Model Evaluation and Threat Research), this comparative analysis is useful for those tracking industry self-governance and responsible scaling policy developments across major AI labs.METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced...ai-safetygovernancepolicyevaluation+6Source ↗ describes how each major AI developer's policy uses capability thresholds for biological weapons development, cyberattacks, autonomous replication, and automated AI R&D, with commitments to conduct model evaluations assessing whether models are approaching thresholds that could enable severe harm.
Expert Survey Findings
An OECD-affiliated survey on AI thresholds↗🔗 webOECD-affiliated survey on AI thresholdsPublished via Oxford's AIGI and affiliated with the OECD, this survey is directly relevant to ongoing international efforts (e.g., AI Safety Institutes, EU AI Act) to establish capability thresholds that trigger regulatory scrutiny of frontier AI models.This OECD-affiliated survey examines how thresholds and capability benchmarks should be defined and applied to advanced AI systems for governance and risk management purposes. I...governancecapabilitiesevaluationpolicy+5Source ↗ found that experts agreed if training compute thresholds are exceeded, AI companies should:
- Conduct additional risk assessments (e.g., via model evaluations)
- Notify an independent public body (e.g., EU AI Office, FTC, or AI Safety Institute)
- Notify the government
Participants noted that risk assessment frameworks from safety-critical industries (nuclear, maritime, aviation, healthcare, finance, space) provide valuable precedent for AI governance.
Sources & Resources
Primary Research
| Source | Type | Key Findings | Relevance |
|---|---|---|---|
| Anthropic Responsible Scaling Policy↗🔗 web★★★★☆AnthropicResponsible Scaling PolicyThis is Anthropic's foundational policy document establishing how it gates deployment of increasingly capable models; a key reference for understanding industry-led AI governance frameworks and voluntary safety commitments.Anthropic introduces its Responsible Scaling Policy (RSP), a framework of technical and organizational protocols for managing catastrophic risks as AI systems become more capabl...governancepolicyai-safetycapabilities+6Source ↗ | Industry Policy | Defines capability thresholds for safety measures | Framework implementation |
| OpenAI Preparedness Framework↗🔗 web★★★★☆OpenAIOpenAI Preparedness FrameworkThis is OpenAI's official Preparedness Framework page, relevant to discussions of frontier AI governance, deployment safety standards, and how leading labs operationalize risk thresholds before releasing powerful models.OpenAI's Preparedness initiative outlines a framework for tracking, evaluating, and mitigating catastrophic risks from frontier AI models. It establishes risk thresholds across ...ai-safetygovernanceevaluationred-teaming+5Source ↗ | Industry Policy | Risk assessment methodology | Threshold identification |
| METR Dangerous Capability Evaluations↗🔗 web★★★★☆METRMETR: Model Evaluation and Threat ResearchMETR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvem...evaluationred-teamingcapabilitiesai-safety+5Source ↗ | Research | Systematic capability testing | Current capability baselines |
| Epoch AI Capability Forecasts↗🔗 web★★★★☆Epoch AIEpoch AI - AI Research and Forecasting OrganizationEpoch AI is a key reference organization for empirical data on AI scaling trends; their compute and training run databases are widely cited in AI safety and governance discussions.Epoch AI is a research organization focused on investigating and forecasting trends in artificial intelligence, particularly around compute, training data, and algorithmic progr...capabilitiescomputegovernancepolicy+4Source ↗ | Research | Timeline predictions for AI milestones | Forecasting methodology |
Government & Policy
| Organization | Resource | Focus |
|---|---|---|
| NIST AI Risk Management Framework↗🏛️ government★★★★★NISTNIST AI Risk Management FrameworkThe NIST AI RMF is a widely referenced U.S. government standard for AI risk governance, frequently cited in policy discussions and used by organizations building internal AI safety and compliance programs; relevant to AI safety researchers tracking institutional governance approaches.The NIST AI RMF is a voluntary, consensus-driven framework released in January 2023 to help organizations identify, assess, and manage risks associated with AI systems while pro...governancepolicyai-safetydeployment+4Source ↗ | US Government | Risk assessment standards |
| UK AISI Research↗🏛️ government★★★★☆UK GovernmentAI Safety Institute - GOV.UKThis is the official UK government hub for AI safety policy and research; important for tracking state-level institutional responses to frontier AI risks and international safety coordination efforts.The UK AI Safety Institute (recently rebranded as the AI Security Institute) is a government body under the Department for Science, Innovation and Technology focused on minimizi...ai-safetygovernancepolicyevaluation+4Source ↗ | UK Government | Model evaluation protocols |
| EU AI Office↗🔗 web★★★★☆European UnionEuropean approach to artificial intelligenceThis is the official European Commission policy hub for AI governance, directly relevant to AI safety researchers tracking how major jurisdictions are regulating and shaping AI development through binding law and strategic investment.This page outlines the European Commission's comprehensive policy framework for AI, centered on promoting trustworthy, human-centric AI through the AI Act, AI Continent Action P...governancepolicyai-safetydeployment+4Source ↗ | EU Government | Regulatory frameworks |
| RAND Corporation AI Studies↗🔗 web★★★★☆RAND CorporationRAND: AI and National SecurityRAND is a major U.S. think tank with significant influence on government AI policy; their research often shapes defense and national security AI guidelines, making it a key reference for governance and policy-oriented AI safety work.RAND Corporation's AI research hub covers policy, national security, and governance implications of artificial intelligence. It aggregates reports, analyses, and commentary on A...governancepolicyai-safetyexistential-risk+3Source ↗ | Think Tank | National security implications |
Technical Benchmarks & Evaluation
| Benchmark | Domain | Current Frontier Score (Dec 2025) | Threshold Relevance |
|---|---|---|---|
| MMLU↗📄 paper★★★☆☆arXiv[2009.03300] Measuring Massive Multitask Language UnderstandingMMLU is a foundational capability benchmark widely used to track LLM progress; relevant to AI safety for understanding capability levels and evaluating whether models meet thresholds for safe deployment.Dan Hendrycks, Collin Burns, Steven Basart et al. (2020)7,359 citationsIntroduces the MMLU benchmark, a comprehensive evaluation suite covering 57 subjects across STEM, humanities, social sciences, and more, designed to measure breadth and depth of...capabilitiesevaluationllmbenchmarks+2Source ↗ | General Knowledge | 85-90% | Domain expertise baseline |
| ARC-AGI-1↗📄 paper★★★☆☆arXivOn the Measure of IntelligenceFoundational work on defining and measuring intelligence, critical for AI safety as it addresses how we evaluate and compare AI systems' capabilities and safety relevant properties.François Chollet (2019)This paper argues that current AI benchmarking practices, which measure skill at specific tasks like games, fail to capture true intelligence because skill can be artificially i...capabilitiestrainingevaluationcapability+1Source ↗ | Abstract Reasoning | 75-87% (o3-preview) | Complex reasoning threshold |
| ARC-AGI-2↗🔗 webARC-AGI-2 BenchmarkARC-AGI-2 is a leading benchmark for measuring progress toward general reasoning and AGI; relevant for AI safety researchers tracking the gap between human and AI reasoning capabilities and evaluating whether scaling laws are sufficient.ARC-AGI-2 is a 2025 benchmark designed to stress-test AI reasoning systems, where pure LLMs score 0% and frontier reasoning systems achieve only single-digit percentages despite...evaluationcapabilitiesagibenchmarks+2Source ↗ | Abstract Reasoning | 54-75% (GPT-5.2) | Next-gen reasoning threshold |
| SWE-bench Verified↗🔗 webSWE-bench Official LeaderboardsSWE-bench is a key industry benchmark for tracking AI coding agent capabilities; useful for understanding the pace of progress in autonomous software engineering, which has implications for AI-assisted research and recursive self-improvement risks.SWE-bench is a benchmark and leaderboard platform for evaluating AI models on real-world software engineering tasks, particularly resolving GitHub issues in open-source Python r...capabilitiesevaluationagentictool-use+3Source ↗ | Software Engineering | 60-70% | Autonomous code execution |
| SWE-bench Pro↗🔗 webScale AI's SWE-Bench ProUseful reference for tracking AI coding capabilities and benchmark integrity; relevant to researchers monitoring AI progress thresholds in software engineering as a leading indicator of more autonomous AI systems.Scale AI introduces SWE-Bench Pro, an enhanced version of the SWE-Bench coding benchmark designed to address limitations in existing evaluations of AI software engineering capab...capabilitiesevaluationbenchmarkstechnical-safety+2Source ↗ | Real-world Coding | 17-23% | Generalization to novel code |
| MATH↗📄 paper★★★☆☆arXiv[2103.03874] Measuring Mathematical Problem Solving With the MATH DatasetMATH is a widely-used benchmark in AI capabilities research; results here established an early baseline showing scaling limits, later revisited as models like GPT-4 and specialized reasoning models achieved substantially higher scores.Dan Hendrycks, Collin Burns, Saurav Kadavath et al. (2021)4,613 citationsThis paper introduces MATH, a benchmark of 12,500 competition mathematics problems with step-by-step solutions, revealing that large Transformer models achieve surprisingly low ...capabilitiesevaluationllmcompute+2Source ↗ | Mathematical Reasoning | 60-80% | Multi-step reasoning |
Risk Assessment Research
| Research Area | Key Papers | Organizations |
|---|---|---|
| Bioweapons Risk | RAND Biosecurity Assessment↗🔗 web★★★★☆RAND CorporationRAND Corporation studyRAND research reports on AI and bioweapons risk are directly relevant to frontier AI evaluation policy, particularly debates around capability thresholds used in safety frameworks like Anthropic's RSP or OpenAI's preparedness framework.This RAND Corporation research report examines the risk of AI systems providing meaningful uplift to actors seeking to develop biological weapons, focusing on how to assess capa...existential-riskevaluationred-teamingcapabilities+6Source ↗ | RAND, Johns Hopkins CNAS |
| Economic Displacement | McKinsey AI Impact↗🔗 web★★★☆☆McKinsey & CompanyMcKinsey State of AI in 2024Useful as an industry benchmark for understanding real-world AI adoption trends; content inaccessible at time of indexing due to access restrictions, so details are inferred from the report's known structure and prior editions.McKinsey's annual survey-based report tracking enterprise AI adoption, investment trends, and emerging risks across industries. The report provides quantitative benchmarks on ho...capabilitiesgovernancedeploymentevaluation+1Source ↗ | McKinsey, Brookings Institution |
| Authentication Collapse | Deepfake Detection Challenges↗📄 paper★★★☆☆arXivAuthentication systemsThis arxiv preprint demonstrates curriculum learning applied to multimodal deep learning for post-disaster analytics, relevant to AI safety through its focus on improving model robustness and reliability in high-stakes emergency response scenarios.Huseyin Fuat Alsan, Taner Arsan (2023)This paper proposes a curriculum learning approach for post-disaster analytics using multimodal deep learning models that jointly process images and text. The authors introduce ...capabilitiestrainingeconomiccapability+1Source ↗ | UC Berkeley, MIT |
| Strategic Deception | Constitutional AI Research↗📄 paper★★★☆☆arXivConstitutional AI: Harmlessness from AI FeedbackConstitutional AI paper presenting a method for training AI systems to be harmless using AI feedback based on a set of constitutional principles, addressing a fundamental challenge in AI alignment and safety.Yanuo Zhou (2025)2,673 citationsanthropickb-sourceSource ↗ | Anthropic, Redwood Research |
Additional Sources
| Source | Type | Key Finding |
|---|---|---|
| International AI Safety Report (Oct 2025)↗🔗 webInternational AI Safety Report (October 2025)This is an official interim update to the landmark International AI Safety Report, a major intergovernmental reference document on AI safety risks; highly relevant for those tracking global AI governance and the evolving capabilities landscape as of late 2025.A focused interim update to the International AI Safety Report, chaired by Yoshua Bengio, covering significant developments in AI capabilities and their risk implications betwee...ai-safetygovernancecapabilitiespolicy+4Source ↗ | Government | Risk thresholds can be crossed between annual cycles due to post-training/inference advances |
| Future of Life Institute AI Safety Index 2025↗🔗 web★★★☆☆Future of Life InstituteFLI AI Safety Index Summer 2025Published by the Future of Life Institute, this index provides a structured external audit of major AI labs' safety practices, useful for tracking industry accountability trends and identifying gaps between stated safety commitments and measurable actions.The Future of Life Institute's AI Safety Index Summer 2025 systematically evaluates leading AI companies on safety practices, finding widespread deficiencies across risk managem...ai-safetygovernanceevaluationexistential-risk+4Source ↗ | NGO | Industry fundamentally unprepared; Anthropic leads (C+) but none score above D in existential safety |
| Berkeley CLTC Intolerable Risk Thresholds↗🔗 webBerkeley CLTC Working Paper on Intolerable Risk ThresholdsPublished November 2024 by Berkeley's CLTC, this working paper is relevant to policymakers and safety researchers working on AI red lines, model evaluations, and governance frameworks for frontier AI systems.This Berkeley Center for Long-Term Cybersecurity working paper examines how to define and operationalize 'intolerable risk' thresholds for AI systems, providing a framework for ...ai-safetygovernancepolicyevaluation+5Source ↗ | Academic | Models 4x+ more capable require comprehensive risk assessment |
| METR Common Elements Report (Dec 2025)↗🔗 web★★★★☆METRMETR's analysis of 12 companiesPublished by METR (Model Evaluation and Threat Research), this comparative analysis is useful for those tracking industry self-governance and responsible scaling policy developments across major AI labs.METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced...ai-safetygovernancepolicyevaluation+6Source ↗ | Research | All major labs use capability thresholds for bio, cyber, replication, AI R&D |
| ARC Prize 2025 Results↗🔗 webARC Prize 2024-2025 resultsARC Prize is a major AI benchmark competition testing abstract reasoning; its results are closely watched by the AI safety community as a potential indicator of progress toward AGI-level capabilities and associated risk thresholds.Comprehensive analysis of the ARC Prize competition results for 2024-2025, evaluating AI systems' performance on the Abstraction and Reasoning Corpus (ARC) benchmark designed to...capabilitiesevaluationai-safetytechnical-safety+1Source ↗ | Academic | First AI system (Poetiq/GPT-5.2) exceeds human average on ARC-AGI-2 reasoning |
| Epoch AI Compute Trends↗🔗 web★★★★☆Epoch AIEpoch AI: AI Trends & Metrics DashboardEpoch AI's trends dashboard is a key empirical reference for AI safety researchers assessing capability trajectories and compute scaling; useful for grounding risk assessments and policy arguments in observed data.Epoch AI's trends page provides data-driven tracking of key metrics in AI development, including compute scaling, model capabilities, and training trends. It serves as a quantit...capabilitiescomputeevaluationai-safety+2Source ↗ | Research | Training compute grows 4-5x yearly; capability improvement doubled in 2024 |
References
OpenAI is a leading AI research and deployment company focused on building advanced AI systems, including GPT and o-series models, with a stated mission of ensuring artificial general intelligence (AGI) benefits all of humanity. The homepage serves as a gateway to their research, products, and policy work spanning capabilities and safety.
2[2009.03300] Measuring Massive Multitask Language UnderstandingarXiv·Dan Hendrycks et al.·2020·Paper▸
Introduces the MMLU benchmark, a comprehensive evaluation suite covering 57 subjects across STEM, humanities, social sciences, and more, designed to measure breadth and depth of language model knowledge. The benchmark tests models from elementary to professional level and reveals significant gaps between human expert performance and state-of-the-art models at the time of publication. It became a standard benchmark for tracking LLM capability progress.
3[2310.09049] SAI: Solving AI Tasks with Systematic Artificial Intelligence in Communication NetworkarXiv·Lei Yao, Yong Zhang, Zilong Yan & Jialu Tian·2023·Paper▸
This paper proposes SAI, a systematic AI framework for solving diverse AI tasks in communication networks by integrating large language models with structured reasoning approaches. It addresses how LLMs can be applied to network management and optimization problems through systematic decomposition of complex communication tasks. The work explores capability thresholds and risk assessment for AI deployment in critical network infrastructure.
Epoch AI analyzes how many AI models would fall above various compute thresholds (measured in FLOPs), providing empirical projections relevant to governance frameworks that use compute as a regulatory trigger. The analysis helps policymakers and researchers understand the practical scope and selectivity of compute-based oversight mechanisms.
Google DeepMind is a leading AI research laboratory combining the former DeepMind and Google Brain teams, focused on developing advanced AI systems and conducting research across capabilities, safety, and applications. The organization is one of the most influential labs in AI development, working on frontier models including Gemini and publishing widely-cited safety and capabilities research.
This RAND Corporation research report examines the risk of AI systems providing meaningful uplift to actors seeking to develop biological weapons, focusing on how to assess capability thresholds and decompose the problem for evaluation purposes. It likely provides a framework for analyzing when AI crosses dangerous capability boundaries in the bioweapons domain and how to structure risk assessments accordingly.
This page outlines the European Commission's comprehensive policy framework for AI, centered on promoting trustworthy, human-centric AI through the AI Act, AI Continent Action Plan, and Apply AI Strategy. It aims to balance Europe's global AI competitiveness with safety, fundamental rights, and democratic values. Key initiatives include AI Factories, the InvestAI Facility, GenAI4EU, and the Apply AI Alliance.
Epoch AI is a research organization focused on investigating and forecasting trends in artificial intelligence, particularly around compute, training data, and algorithmic progress. They produce empirical analyses and datasets to inform understanding of AI development trajectories and support better decision-making in AI governance and safety.
This OECD-affiliated survey examines how thresholds and capability benchmarks should be defined and applied to advanced AI systems for governance and risk management purposes. It likely synthesizes expert views on identifying dangerous capability levels and triggering regulatory or safety interventions. The work is relevant to policymakers and AI developers designing evaluation frameworks for frontier models.
Deloitte's 2024 analysis frames deepfakes as a cybersecurity-scale threat to online trust, projecting the deepfake detection market will grow 42% annually from $5.5B in 2023 to $15.7B by 2026. The report draws parallels to cybersecurity spending trajectories and highlights that costs of maintaining content authenticity will likely be distributed across consumers, creators, and advertisers. Consumer surveys reveal widespread skepticism and demand for standardized AI content labeling.
Meta's official AI homepage showcases their broad research and product portfolio including Llama 4 (large language models), Segment Anything Model 3 (computer vision), V-JEPA 2 (world models), and AI glasses hardware. The company organizes its AI work around four research pillars: Communication & Language, Embodiment & Actions, Alignment, and Core Learning & Reasoning. Meta emphasizes open-source development and practical deployment at scale.
ARC-AGI-2 is a 2025 benchmark designed to stress-test AI reasoning systems, where pure LLMs score 0% and frontier reasoning systems achieve only single-digit percentages despite humans solving all tasks. It targets three core capability gaps—symbolic interpretation, compositional reasoning, and contextual rule application—demonstrating that scaling alone is insufficient and new architectures or test-time adaptation methods are required.
Sora is OpenAI's text-to-video generation model and app that converts text prompts or images into videos with high realism, including automatic sound. It supports features like character casting, remixing, and multiple visual styles including cinematic and photorealistic.
Anthropic introduces its Responsible Scaling Policy (RSP), a framework of technical and organizational protocols for managing catastrophic risks as AI systems become more capable. The policy defines AI Safety Levels (ASL-1 through ASL-5+), modeled after biosafety level standards, requiring increasingly strict safety, security, and operational measures tied to a model's potential for catastrophic risk. Current Claude models are classified ASL-2, with ASL-3 and beyond triggering stricter deployment and security requirements.
A McKinsey Global Institute report examining how AI agents and robotics are reshaping labor markets and workforce skills. The report reportedly finds that 57% of workers may need to develop new skill partnerships with AI systems, analyzing how human-AI collaboration will transform job roles and economic productivity.
Redwood Research is a nonprofit AI safety organization that pioneered the 'AI control' research agenda, focusing on preventing intentional subversion by misaligned AI systems. Their key contributions include the ICML paper on AI Control protocols, the Alignment Faking demonstration (with Anthropic), and consulting work with governments and AI labs on misalignment risk mitigation.
SWE-bench is a benchmark and leaderboard platform for evaluating AI models on real-world software engineering tasks, particularly resolving GitHub issues in open-source Python repositories. It offers multiple dataset variants (Lite, Verified, Multimodal) and standardized metrics to compare coding agents. It has become a widely-used standard for assessing the practical software engineering capabilities of LLM-based agents.
METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.
The NIST AI RMF is a voluntary, consensus-driven framework released in January 2023 to help organizations identify, assess, and manage risks associated with AI systems while promoting trustworthiness across design, development, deployment, and evaluation. It provides structured guidance organized around core functions and is accompanied by a Playbook, Roadmap, and a Generative AI Profile (2024) addressing risks specific to generative AI systems.
GitHub Copilot is an AI-powered coding assistant that integrates into IDEs, terminals, and GitHub workflows to assist developers with code completion, autonomous agent-based coding tasks, and project management. It supports multiple LLMs and allows assignment of coding tasks to AI agents that can autonomously write code and create pull requests.
Epoch AI analyzes how consumer GPUs like the RTX 5090 can run open-weight models that match frontier LLM performance from 6-12 months prior. The analysis tracks this gap across multiple benchmarks (GPQA Diamond, MMLU-Pro, LMArena) and finds the democratization trend is driven by open-weight scaling, model distillation, and GPU progress.
ElevenLabs is a leading AI voice technology platform offering text-to-speech, voice cloning, speech-to-text, and AI agent capabilities across 70+ languages. It serves enterprises, creators, and developers with tools for synthetic voice generation and audio content creation. The platform represents a prominent example of advanced synthetic media technology with significant implications for deepfakes, identity fraud, and information integrity.
This paper proposes a curriculum learning approach for post-disaster analytics using multimodal deep learning models that jointly process images and text. The authors introduce Dynamic Task and Weight Prioritization (DATWEP), a novel gradient-based curriculum learning method that automatically determines task difficulty during training without manual specification. The approach combines U-Net for semantic segmentation, image encoding, and a custom text classifier for visual question answering, evaluated on the FloodNet dataset for flood damage assessment.
McKinsey's annual survey-based report tracking enterprise AI adoption, investment trends, and emerging risks across industries. The report provides quantitative benchmarks on how organizations are deploying AI, including generative AI, and what governance and risk management practices they are implementing.
A focused interim update to the International AI Safety Report, chaired by Yoshua Bengio, covering significant developments in AI capabilities and their risk implications between full annual editions. The report is produced by an international panel of experts from over 30 countries and aims to keep policymakers and researchers current on fast-moving AI developments. It serves as an authoritative, consensus-oriented reference for AI safety governance.
This resource surveys leading AI-powered deepfake detection tools available in 2025, including OpenAI's detection tool, evaluating their capabilities for identifying synthetically generated media. It serves as a practical reference for organizations and researchers assessing defenses against AI-generated disinformation and identity fraud. The piece highlights the growing ecosystem of countermeasures to synthetic media threats.
Epoch AI analyzes historical trends in the training compute used for frontier AI models, finding that compute has grown approximately 4-5x per year. This rapid scaling has significant implications for AI capabilities trajectories, resource requirements, and safety planning horizons.
The UK AI Safety Institute (recently rebranded as the AI Security Institute) is a government body under the Department for Science, Innovation and Technology focused on minimizing risks from rapid and unexpected AI advances. It conducts and publishes safety research, international coordination reports, and policy guidance, while managing grants for systemic AI safety research.
Anthropic's official model card for the Claude 3 family (Haiku, Sonnet, Opus), documenting capability evaluations, safety assessments, and alignment properties. It covers frontier model benchmarks, red-teaming results, and responsible scaling policy (RSP) threshold evaluations for biological, chemical, and other catastrophic risks. The document represents Anthropic's public transparency effort around deploying a state-of-the-art AI system.
MIRI is a nonprofit research organization focused on ensuring that advanced AI systems are safe and beneficial. It conducts technical research on the mathematical foundations of AI alignment, aiming to solve core theoretical problems before transformative AI is developed. MIRI is one of the pioneering organizations in the AI safety field.
METR presents updated capability evaluations for Claude Sonnet and OpenAI o1 models, assessing whether these frontier AI systems approach autonomy thresholds relevant to safety and deployment decisions. The evaluations focus on task autonomy and the potential for models to pose novel risks as their capabilities scale.
Meta's Make-A-Video is an AI system that generates short video clips from text descriptions, images, or existing videos. It extends text-to-image generation techniques into the temporal domain, enabling creation of realistic and imaginative video content from natural language prompts. The system represents a significant capability milestone in generative AI for multimedia content.
OpenAI's Preparedness initiative outlines a framework for tracking, evaluating, and mitigating catastrophic risks from frontier AI models. It establishes risk thresholds across categories like cybersecurity, CBRN threats, and persuasion, and defines safety standards that must be met before model deployment.
Epoch AI analyzes the rapidly growing electricity demands of training frontier AI models, examining trends in power consumption, infrastructure constraints, and implications for AI development trajectories. The analysis quantifies how compute scaling translates into energy requirements and identifies key bottlenecks in power availability that may shape the pace of AI progress.
35[2103.03874] Measuring Mathematical Problem Solving With the MATH DatasetarXiv·Dan Hendrycks et al.·2021·Paper▸
This paper introduces MATH, a benchmark of 12,500 competition mathematics problems with step-by-step solutions, revealing that large Transformer models achieve surprisingly low accuracy and that scaling alone is insufficient for mathematical reasoning. The authors also release an auxiliary pretraining dataset to aid mathematical learning. The work highlights a fundamental gap between current scaling trends and genuine mathematical reasoning ability.
36MIT persuasion studyScience (peer-reviewed)·G. Spitale, N. Biller-Andorno & Federico Germani·2023·Paper▸
This MIT study examined whether humans can distinguish between accurate and false information in tweets, and whether they can identify AI-generated content from GPT-3 versus human-written tweets. With 697 participants, researchers found that GPT-3 presents a dual challenge: it can produce accurate, easily understandable information but also generates more compelling disinformation. Critically, humans cannot reliably distinguish between GPT-3-generated and human-written tweets, raising significant concerns about AI's potential to spread disinformation during an infodemic.
A Stanford HAI study examines how people respond to messages they believe are generated by AI versus humans, finding that individuals tend to place higher credibility or trust in AI-generated content. This has significant implications for misinformation, persuasion, and the societal risks of AI-generated communication at scale.
Scale AI introduces SWE-Bench Pro, an enhanced version of the SWE-Bench coding benchmark designed to address limitations in existing evaluations of AI software engineering capabilities. The benchmark aims to provide more reliable and contamination-resistant assessments of AI systems' ability to solve real-world software engineering tasks. This work is relevant to tracking AI capability thresholds in code generation and autonomous software development.
Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its family of AI assistants, with a stated mission of responsible development and maintenance of advanced AI for long-term human benefit.
Epoch AI's trends page provides data-driven tracking of key metrics in AI development, including compute scaling, model capabilities, and training trends. It serves as a quantitative reference for understanding the trajectory of AI progress across multiple dimensions. The resource aggregates empirical data to help researchers and policymakers assess the pace and direction of AI advancement.
Salesforce reports on AI adoption trends in customer service, highlighting how businesses are deploying AI tools to automate interactions, improve efficiency, and manage customer relationships. The report provides industry data on AI usage patterns and emerging capabilities in enterprise customer service contexts.
This interdisciplinary meta-review of ~100 studies examines critical shortcomings in quantitative AI benchmarking practices over the past decade. The paper identifies fine-grained technical issues (dataset biases, data contamination, inadequate documentation) alongside broader sociotechnical problems (overemphasis on text-based single-test evaluation, failure to account for multimodal and interactive AI systems). The authors highlight systemic flaws including misaligned incentives, construct validity issues, and gaming of results, arguing that benchmarking practices are shaped by commercial and competitive dynamics that often prioritize performance metrics over societal concerns. The review challenges the disproportionate trust placed in benchmarks and advocates for improved accountability and real-world relevance in AI evaluation.
METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced AI systems. The analysis synthesizes patterns across responsible scaling policies, model cards, and safety frameworks to provide a comparative overview of industry norms. It serves as a reference for understanding where consensus exists and where significant variation or absence of commitments remains.
RAND Corporation's AI research hub covers policy, national security, and governance implications of artificial intelligence. It aggregates reports, analyses, and commentary on AI risks, military applications, and regulatory frameworks from one of the leading U.S. defense and policy think tanks.
This Berkeley Center for Long-Term Cybersecurity working paper examines how to define and operationalize 'intolerable risk' thresholds for AI systems, providing a framework for identifying which AI capabilities or behaviors should be categorically prohibited or constrained. It contributes to the growing policy and technical discourse around AI red lines and safety limits.
46[2009.13081] What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical ExamsarXiv·Di Jin et al.·2020·Paper▸
MedQA is the first free-form multiple-choice open domain question answering dataset for medical problems, sourced from professional medical board exams across three languages: English, simplified Chinese, and traditional Chinese, containing 12,723, 34,251, and 14,123 questions respectively. The authors implement both rule-based and neural methods combining document retrieval and machine comprehension, finding that current best approaches achieve only 36.7%, 42.0%, and 70.1% test accuracy on English, traditional Chinese, and simplified Chinese questions respectively, demonstrating significant challenges for existing OpenQA systems.
The Future of Life Institute's AI Safety Index Summer 2025 systematically evaluates leading AI companies on safety practices, finding widespread deficiencies across risk management, transparency, and existential safety planning. Anthropic receives the highest grade of C+, indicating that even the best-performing company falls significantly short of adequate safety standards. The report serves as a comparative benchmark for industry accountability.
This paper argues that current AI benchmarking practices, which measure skill at specific tasks like games, fail to capture true intelligence because skill can be artificially inflated through prior knowledge and training data. The authors propose a formal definition of intelligence based on Algorithmic Information Theory, conceptualizing it as skill-acquisition efficiency across diverse tasks. They introduce the Abstraction and Reasoning Corpus (ARC), a benchmark designed with human-like priors to enable fair comparisons of general fluid intelligence between AI systems and humans, addressing the need for appropriate feedback signals in developing more intelligent and human-like artificial systems.
GSM8K is a benchmark dataset of 8.5K high-quality grade school math word problems designed to evaluate multi-step mathematical reasoning in language models. The paper demonstrates that state-of-the-art transformer models struggle with this conceptually simple task. To address this limitation, the authors propose a verification approach where multiple candidate solutions are generated and ranked by a trained verifier, showing that verification significantly improves performance and scales more effectively than finetuning baselines.
Comprehensive analysis of the ARC Prize competition results for 2024-2025, evaluating AI systems' performance on the Abstraction and Reasoning Corpus (ARC) benchmark designed to test general fluid intelligence. The results provide insight into the current state of AI reasoning capabilities and how close frontier models are to human-level performance on novel problem-solving tasks.
SecBench appears to be a GitHub organization focused on security benchmarking, likely providing standardized evaluation frameworks for assessing AI or software security capabilities. It aims to establish measurable thresholds and risk assessment criteria for security-related tasks. The project likely offers tools or datasets for evaluating security-relevant AI capabilities.
NVIDIA Omniverse is a platform for building and operating metaverse applications, enabling real-time 3D simulation, collaboration, and digital twin creation. It provides tools for connecting and simulating physically accurate virtual worlds used in robotics, autonomous vehicles, and industrial applications. The platform is increasingly relevant to AI development as a simulation environment for training and testing AI systems.