Responsible Scaling Policies
Responsible Scaling Policies
Comprehensive analysis of Responsible Scaling Policies showing 20 companies with published frameworks as of Dec 2025, with SaferAI grading major policies 1.9-2.2/5 for specificity. Evidence suggests moderate effectiveness hindered by voluntary nature, competitive pressure among 3+ labs, and ~7-month capability doubling potentially outpacing evaluation science, though third-party verification (METR evaluated 5+ models) and Seoul Summit commitments (16 signatories) represent meaningful coordination progress.
Overview
Responsible Scaling Policies (RSPs) are self-imposed commitments by AI labs to tie AI development to safety progress. The core idea is simple: before scaling to more capable systems, labs commit to demonstrating that their safety measures are adequate for the risks those systems would pose. If evaluations reveal dangerous capabilities without adequate safeguards, development should pause until safety catches up.
Anthropic introduced the first RSP↗🔗 web★★★★☆AnthropicResponsible Scaling PolicyThis is Anthropic's foundational policy document establishing how it gates deployment of increasingly capable models; a key reference for understanding industry-led AI governance frameworks and voluntary safety commitments.Anthropic introduces its Responsible Scaling Policy (RSP), a framework of technical and organizational protocols for managing catastrophic risks as AI systems become more capabl...governancepolicyai-safetycapabilities+6Source ↗ in September 2023, establishing "AI Safety Levels" (ASL-1 through ASL-4+) analogous to biosafety levels. OpenAI followed with its Preparedness Framework↗🔗 web★★★★☆OpenAIPreparedness FrameworkOpenAI's official institutional framework for catastrophic risk evaluation; relevant for understanding how leading AI labs operationalize safety policies and set deployment guardrails for frontier models.OpenAI's Preparedness Framework outlines a structured approach to evaluating and managing catastrophic risks from frontier AI models, including threats related to CBRN weapons, ...governancepolicyevaluationdeployment+6Source ↗ in December 2023, and Google DeepMind published its Frontier Safety Framework↗🔗 web★★★★☆Google DeepMindGoogle DeepMind: Introducing the Frontier Safety FrameworkPublished May 2024, this is Google DeepMind's formal responsible scaling policy, comparable to Anthropic's RSP and OpenAI's Preparedness Framework; relevant for comparing industry approaches to frontier model governance and safety commitments.Google DeepMind's Frontier Safety Framework (FSF) establishes a structured approach to identifying and mitigating potential severe harms from frontier AI models, focusing on 'cr...ai-safetygovernanceevaluationdeployment+5Source ↗ in May 2024. By late 2024, twelve major AI companies↗🔗 web★★★★☆METRMETR's analysis of 12 companiesPublished by METR (Model Evaluation and Threat Research), this comparative analysis is useful for those tracking industry self-governance and responsible scaling policy developments across major AI labs.METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced...ai-safetygovernancepolicyevaluation+6Source ↗ had published some form of frontier AI safety policy, and the Seoul Summit↗🏛️ government★★★★☆UK GovernmentSeoul Frontier AI CommitmentsOfficial UK government publication documenting voluntary safety pledges from frontier AI companies at the 2024 Seoul AI Summit; a key milestone in international AI governance efforts following the 2023 Bletchley Park Summit.A collection of voluntary safety commitments made by leading AI companies at the AI Seoul Summit 2024, building on the Bletchley Declaration. Companies pledge to publish safety ...governancepolicyai-safetyevaluation+6Source ↗ secured voluntary commitments from sixteen companies.
RSPs represent a significant governance innovation because they create a mechanism for safety-capability coupling without requiring external regulation. As of December 2025, 20 companies have published frontier AI safety policies, up from 12 at the May 2024 Seoul Summit. Third-party evaluators like METR have conducted pre-deployment assessments of 5+ major models. However, RSPs face fundamental challenges: they are 100% voluntary with no legal enforcement, labs set their own thresholds (leading to SaferAI grades of only 1.9-2.2 out of 5), competitive pressure among 3+ frontier labs creates incentives to interpret policies permissively, and capability doubling times of approximately 7 months may outpace evaluation science.
Quick Assessment
| Dimension | Assessment | Evidence |
|---|---|---|
| Adoption Rate | High | 20 companies with published policies as of Dec 2025; 16 original Seoul signatories |
| Third-Party Verification | Growing | METR evaluated GPT-4.5, Claude 3.5, o3/o4-mini; UK/US AISIs conducting evaluations |
| Threshold Specificity | Medium-Low | SaferAI grade: dropped from 2.2 to 1.9 after Oct 2024 RSP update |
| Compliance Track Record | Mixed | Anthropic self-reported evaluations 3 days late; no major policy violations yet documented |
| Enforcement Mechanism | None | 100% voluntary; no legal penalties for non-compliance |
| Competitive Pressure Risk | High | Racing dynamics incentivize permissive interpretation; 3+ major labs competing |
| Evaluation Coverage | Partial | 12 of 20 companies with published policies have external eval arrangements |
Risk Assessment & Impact
| Dimension | Rating | Assessment |
|---|---|---|
| Safety Uplift | Medium | Creates tripwires; effectiveness depends on follow-through |
| Capability Uplift | Neutral | Not capability-focused |
| Net World Safety | Helpful | Better than nothing; implementation uncertain |
| Lab Incentive | Moderate | PR value; may become required; some genuine commitment |
| Scalability | Unknown | Depends on whether commitments are honored |
| Deception Robustness | Partial | External policy; but evals could be fooled |
| SI Readiness | Unlikely | Pre-SI intervention; can't constrain SI itself |
Research Investment
| Dimension | Estimate | Source |
|---|---|---|
| Lab Policy Team Size | 5-20 FTEs per major lab | Industry estimates |
| External Policy Orgs | $5-15M/yr combined | METR, Apollo, policy institutes |
| Government Evaluation | $20-50M/yr | UK AISI (≈$100M budget), US AISI |
| Total Ecosystem | $50-100M/yr | Cross-sector estimate |
- Recommendation: Increase 3-5x (needs enforcement mechanisms and external verification capacity)
- Differential Progress: Safety-dominant (pure governance; no capability benefit)
Comparison of Major Scaling Policies
The three leading frontier AI labs have published distinct but conceptually similar frameworks. All share the core structure of capability thresholds triggering escalating safeguards, but differ in specificity, governance, and scope.
Policy Framework Comparison
| Aspect | Anthropic RSP | OpenAI Preparedness | DeepMind FSF |
|---|---|---|---|
| First Published | September 2023 | December 2023 | May 2024 |
| Current Version | v2.2 (May 2025)↗🔗 web★★★★☆AnthropicAnthropic: Announcing our updated Responsible Scaling PolicyAnthropic's RSP is a landmark industry governance document; this update is significant for AI safety practitioners tracking how frontier labs operationalize safety commitments and capability-gated deployment policies.Anthropic announces an updated version of its Responsible Scaling Policy (RSP), a framework that ties AI development and deployment decisions to specific capability thresholds c...governancepolicycapabilitiesevaluation+5Source ↗ | v2.0 (April 2025)↗🔗 web★★★★☆OpenAIOpenAI: Preparedness Framework Version 2OpenAI's official internal safety governance framework, updated as v2; relevant for understanding how a leading frontier AI lab operationalizes risk thresholds and pre-deployment evaluation processes.OpenAI's Preparedness Framework v2 outlines the company's structured approach to evaluating and managing catastrophic risks from frontier AI models, including definitions of ris...governanceevaluationdeploymentai-safety+6Source ↗ | v3.0 (October 2025)↗🔗 webGoogle DeepMind: Frontier Safety Framework Version 3.0This is Google DeepMind's official safety framework document (version 3.0), analogous to Anthropic's RSP and OpenAI's Preparedness Framework; essential reference for understanding industry approaches to frontier AI risk management.Google DeepMind's Frontier Safety Framework (v3.0) defines protocols for identifying Critical Capability Levels (CCLs) at which frontier AI models may pose severe risks, and out...ai-safetygovernancepolicyevaluation+6Source ↗ |
| Level Structure | ASL-1 through ASL-4+ | High / Critical | CCL-1 through CCL-4+ |
| Risk Domains | CBRN, AI R&D, Autonomy | Bio/Chem, Cyber, Self-improvement | Autonomy, Bio, Cyber, ML R&D, Manipulation |
| Governance Body | Responsible Scaling Officer | Safety Advisory Group (SAG) | Frontier Safety Team |
| Third-Party Evals | METR↗🔗 web★★★★☆METRMETR: Model Evaluation and Threat ResearchMETR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvem...evaluationred-teamingcapabilitiesai-safety+5Source ↗, UK AISI | METR↗🔗 web★★★★☆METRMETR: Model Evaluation and Threat ResearchMETR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvem...evaluationred-teamingcapabilitiesai-safety+5Source ↗, UK AISI | Internal primarily |
| Pause Commitment | Explicit if safeguards insufficient | Implicit (must have safeguards) | Explicit for CCL thresholds |
| Board Override | Board can override RSO | SAG advises; leadership decides | Not specified |
Capability Threshold Definitions
| Lab | CBRN Threshold | Cyber Threshold | Autonomy/AI R&D Threshold |
|---|---|---|---|
| Anthropic ASL-3 | "Significantly enhances capabilities of non-state actors" beyond publicly available info | Autonomous cyberattacks on hardened targets | "Substantially accelerates" AI R&D timeline |
| OpenAI High | "Meaningful counterfactual assistance to novice actors" creating known threats | "New risks of scaled cyberattacks" | Self-improvement creating "new challenges for human control" |
| OpenAI Critical | "Unprecedented new pathways to severe harm" | Novel attack vectors at scale | Recursive self-improvement; 5x speed improvement |
| DeepMind CCL | "Heightened risk of severe harm" from bio capabilities | "Sophisticated cyber capabilities" | "Exceptional agency" and ML research capabilities |
Sources: Anthropic RSP↗🔗 web★★★★☆AnthropicResponsible Scaling PolicyAnthropic's RSP is a foundational industry document for responsible development commitments; frequently cited in AI governance discussions as a model for 'if-then' safety commitments from frontier labs.Anthropic's Responsible Scaling Policy (RSP) is a formal commitment outlining how the company will evaluate AI systems for dangerous capabilities and adjust deployment and devel...governancepolicyai-safetyevaluation+5Source ↗, OpenAI Preparedness Framework v2↗🔗 web★★★★☆OpenAIOpenAI: Preparedness Framework Version 2OpenAI's official internal safety governance framework, updated as v2; relevant for understanding how a leading frontier AI lab operationalizes risk thresholds and pre-deployment evaluation processes.OpenAI's Preparedness Framework v2 outlines the company's structured approach to evaluating and managing catastrophic risks from frontier AI models, including definitions of ris...governanceevaluationdeploymentai-safety+6Source ↗, DeepMind FSF v3↗🔗 webGoogle DeepMind: Frontier Safety Framework Version 3.0This is Google DeepMind's official safety framework document (version 3.0), analogous to Anthropic's RSP and OpenAI's Preparedness Framework; essential reference for understanding industry approaches to frontier AI risk management.Google DeepMind's Frontier Safety Framework (v3.0) defines protocols for identifying Critical Capability Levels (CCLs) at which frontier AI models may pose severe risks, and out...ai-safetygovernancepolicyevaluation+6Source ↗
Safeguard Requirements by Level
Diagram (loading…)
flowchart TD
subgraph Anthropic["Anthropic ASL Standards"]
A1[ASL-1: No meaningful risk] --> A2[ASL-2: Current standard security]
A2 --> A3[ASL-3: Enhanced security + deployment controls]
A3 --> A4[ASL-4: Nation-state level security]
end
subgraph OpenAI["OpenAI Preparedness Levels"]
O1[Below High: Standard deployment] --> O2[High: Safeguards before deployment]
O2 --> O3[Critical: Safeguards during development]
end
subgraph DeepMind["DeepMind CCL Levels"]
D1[Below CCL: Standard practices] --> D2[CCL reached: Deployment mitigations]
D2 --> D3[CCL exceeded: Enhanced security + alignment]
end
style A3 fill:#fff3cd
style A4 fill:#ffddcc
style O3 fill:#ffddcc
style D3 fill:#ffddccHow RSPs Work
RSPs create a framework linking capability levels to safety requirements. The core mechanism involves three interconnected processes: capability evaluation, safeguard assessment, and escalation decisions.
Diagram (loading…)
flowchart TD
subgraph Evaluation["1. Capability Evaluation"]
A[Model Checkpoint] --> B[Internal Evals]
B --> C[Third-Party Evals]
C --> D{Threshold Crossed?}
end
subgraph Assessment["2. Safeguard Assessment"]
D -->|Yes| E[Identify Required Safeguards]
E --> F[Current Safeguards Audit]
F --> G{Gap Analysis}
end
subgraph Decision["3. Escalation Decision"]
G -->|Adequate| H[Deploy with Safeguards]
G -->|Insufficient| I[Pause Training/Deployment]
I --> J[Develop New Safeguards]
J --> F
D -->|No| K[Continue Development]
end
H --> L[Monitor Post-Deployment]
K --> M[Next Training Run]
M --> A
style I fill:#ffcccc
style H fill:#ccffcc
style D fill:#fff3cd
style G fill:#fff3cdRSP Ecosystem
The effectiveness of RSPs depends on a network of actors providing oversight, verification, and accountability:
Diagram (loading…)
flowchart TD
subgraph Labs["AI Developers (20 companies)"]
ANT[Anthropic<br/>ASL System]
OAI[OpenAI<br/>Preparedness]
GDM[Google DeepMind<br/>FSF]
OTHER[xAI, Meta, etc.]
end
subgraph Evaluators["Third-Party Evaluators"]
METR[METR<br/>Capability Evals]
APOLLO[Apollo Research<br/>Alignment Evals]
end
subgraph Governments["Government Bodies"]
UKAISI[UK AI Safety Institute]
USAISI[US AI Safety Institute]
INTL[Seoul/France Summits]
end
subgraph Public["Public Accountability"]
CIVIL[Civil Society<br/>SaferAI, FLI]
MEDIA[Media Coverage]
end
Labs -->|Pre-deployment access| Evaluators
Labs -->|Report results| Governments
Evaluators -->|Independent assessment| Governments
Governments -->|Commitments| Labs
CIVIL -->|Scorecard ratings| Labs
MEDIA -->|Public pressure| Labs
style ANT fill:#e8f4ea
style OAI fill:#e8f4ea
style GDM fill:#e8f4ea
style METR fill:#fff3cd
style UKAISI fill:#cce5ff
style USAISI fill:#cce5ffKey Components
| Component | Description | Purpose |
|---|---|---|
| Capability Thresholds | Defined capability levels that trigger requirements | Create clear tripwires |
| Safety Levels | Required safeguards for each capability tier | Ensure safety scales with capability |
| Evaluations | Tests to determine capability and safety level | Provide evidence for decisions |
| Pause Commitments | Agreement to halt if safety is insufficient | Core accountability mechanism |
| Public Commitment | Published policy creates external accountability | Enable monitoring |
Anthropic's AI Safety Levels (ASL)
Anthropic's ASL system↗🔗 web★★★★☆AnthropicResponsible Scaling PolicyAnthropic's RSP is a foundational industry document for responsible development commitments; frequently cited in AI governance discussions as a model for 'if-then' safety commitments from frontier labs.Anthropic's Responsible Scaling Policy (RSP) is a formal commitment outlining how the company will evaluate AI systems for dangerous capabilities and adjust deployment and devel...governancepolicyai-safetyevaluation+5Source ↗ is modeled after Biosafety Levels (BSL-1 through BSL-4) used for handling dangerous pathogens. Each level specifies both capability thresholds and required safeguards.
| Level | Capability Definition | Deployment Safeguards | Security Standard |
|---|---|---|---|
| ASL-1 | No meaningful catastrophic risk | Standard terms of service | Basic security hygiene |
| ASL-2 | Meaningful uplift but not beyond publicly available info | Content filtering, usage policies | Current security measures |
| ASL-3 | Significantly enhances non-state actor capabilities beyond public sources | Enhanced refusals, red-teaming, monitoring | Hardened infrastructure, insider threat protections |
| ASL-4 | Could substantially accelerate CBRN development or enable autonomous harm | Nation-state level protections (details TBD) | Air-gapped systems, extensive vetting |
Current Status (January 2026): All Claude models currently operate at ASL-2. Anthropic activated ASL-3 safeguard development in May 2025 following evaluations of Claude Opus 4.
RSP v2.2 Changes: The October 2024 update↗🔗 web★★★★☆AnthropicAnthropic: Announcing our updated Responsible Scaling PolicyAnthropic's RSP is a landmark industry governance document; this update is significant for AI safety practitioners tracking how frontier labs operationalize safety commitments and capability-gated deployment policies.Anthropic announces an updated version of its Responsible Scaling Policy (RSP), a framework that ties AI development and deployment decisions to specific capability thresholds c...governancepolicycapabilitiesevaluation+5Source ↗ separated "ASL" to refer to safeguard standards rather than model categories, introducing distinct "Capability Thresholds" and "Required Safeguards." Critics argue↗🔗 webAnthropic's Responsible Scaling Policy Update Makes a Step BackwardsPublished by Safer AI, this piece is a critical external review of Anthropic's 2024 RSP update, relevant to discussions on frontier AI governance, corporate accountability, and the credibility of voluntary safety commitments.This analysis critiques Anthropic's updated Responsible Scaling Policy (RSP) for reducing specificity and moving away from concrete, quantitative safety thresholds toward more q...governancepolicyevaluationai-safety+3Source ↗ this reduced specificity compared to v1.
OpenAI's Preparedness Framework
OpenAI's Preparedness Framework↗🔗 web★★★★☆OpenAIPreparedness FrameworkOpenAI's official institutional framework for catastrophic risk evaluation; relevant for understanding how leading AI labs operationalize safety policies and set deployment guardrails for frontier models.OpenAI's Preparedness Framework outlines a structured approach to evaluating and managing catastrophic risks from frontier AI models, including threats related to CBRN weapons, ...governancepolicyevaluationdeployment+6Source ↗ underwent a major revision in April 2025 (v2.0), simplifying from four risk levels to two actionable thresholds.
| Risk Domain | High Threshold | Critical Threshold |
|---|---|---|
| Bio/Chemical | Meaningful assistance to novices creating known threats | Unprecedented pathways to severe harm |
| Cybersecurity | New risks of scaled attacks and exploitation | Novel attack vectors threatening critical infrastructure |
| AI Self-improvement | Challenges for human control | Recursive improvement; 5x development speed |
Framework v2.0 Key Changes:
- Simplified from Low/Medium/High/Critical to just High and Critical
- Removed "Persuasion" as tracked category (now handled through standard safety)
- Added explicit threshold for recursive self-improvement: achieving generational improvement (e.g., o1 to o3) in 1/5th the 2024 development time
- Safety Advisory Group (SAG) now oversees all threshold determinations
Recent Evaluations: OpenAI's January 2026 o3/o4-mini system card↗🔗 web★★★★☆METRMETR’s GPT-4.5 pre-deployment evaluationsPublished by METR (Model Evaluation and Threat Research), this report is part of their ongoing frontier model evaluation program; relevant for understanding how pre-deployment capability evaluations inform responsible scaling policies and deployment decisions for GPT-4.5.METR conducted pre-deployment autonomous capability evaluations of OpenAI's GPT-4.5, assessing its potential for dangerous self-replication, resource acquisition, and general au...evaluationdangerous-capabilitiesautonomous-replicationdeployment+5Source ↗ reported neither model reached High threshold in any tracked category, though biological and cyber capabilities continue trending upward.
Current Implementations
Lab Policy Publication Timeline
| Lab | Policy Name | Initial | Latest Version | Key Features |
|---|---|---|---|---|
| Anthropic | Responsible Scaling Policy↗🔗 web★★★★☆AnthropicResponsible Scaling PolicyAnthropic's RSP is a foundational industry document for responsible development commitments; frequently cited in AI governance discussions as a model for 'if-then' safety commitments from frontier labs.Anthropic's Responsible Scaling Policy (RSP) is a formal commitment outlining how the company will evaluate AI systems for dangerous capabilities and adjust deployment and devel...governancepolicyai-safetyevaluation+5Source ↗ | Sep 2023 | v2.2 (May 2025) | ASL levels, deployment/security standards, external evals |
| OpenAI | Preparedness Framework↗🔗 web★★★★☆OpenAIPreparedness FrameworkOpenAI's official institutional framework for catastrophic risk evaluation; relevant for understanding how leading AI labs operationalize safety policies and set deployment guardrails for frontier models.OpenAI's Preparedness Framework outlines a structured approach to evaluating and managing catastrophic risks from frontier AI models, including threats related to CBRN weapons, ...governancepolicyevaluationdeployment+6Source ↗ | Dec 2023 | v2.0 (Apr 2025) | High/Critical thresholds, SAG governance, tracked categories |
| Google DeepMind | Frontier Safety Framework↗🔗 web★★★★☆Google DeepMindGoogle DeepMind: Introducing the Frontier Safety FrameworkPublished May 2024, this is Google DeepMind's formal responsible scaling policy, comparable to Anthropic's RSP and OpenAI's Preparedness Framework; relevant for comparing industry approaches to frontier model governance and safety commitments.Google DeepMind's Frontier Safety Framework (FSF) establishes a structured approach to identifying and mitigating potential severe harms from frontier AI models, focusing on 'cr...ai-safetygovernanceevaluationdeployment+5Source ↗ | May 2024 | v3.0 (Oct 2025) | CCL levels, manipulation risk domain added |
| xAI | Safety Framework | 2024 | v1.0 | Evaluation and deployment procedures |
| Meta | Frontier Model Safety | 2024 | v1.0 | Purple-team evaluations, staged deployment |
Policy Adoption Timeline
| Date | Milestone | Companies/Details |
|---|---|---|
| Sep 2023 | First RSP published | Anthropic RSP v1.0 |
| Dec 2023 | Second framework | OpenAI Preparedness Framework |
| May 2024 | Seoul Summit | 16 companies sign commitments |
| May 2024 | Third framework | Google DeepMind FSF |
| Oct 2024 | Major revision | Anthropic RSP v2.0 (criticized for reduced specificity) |
| Apr 2025 | Framework update | OpenAI Preparedness v2.0 (simplified to High/Critical) |
| May 2025 | First ASL-3 | Anthropic activates elevated safeguards for Claude Opus 4 |
| Oct 2025 | Policy count | 20 companies with published policies |
| Dec 2025 | Third-party coverage | 12 companies with METR arrangements |
Seoul Summit Commitments (May 2024)
The Seoul AI Safety Summit↗🏛️ government★★★★☆UK GovernmentSeoul Frontier AI CommitmentsOfficial UK government publication documenting voluntary safety pledges from frontier AI companies at the 2024 Seoul AI Summit; a key milestone in international AI governance efforts following the 2023 Bletchley Park Summit.A collection of voluntary safety commitments made by leading AI companies at the AI Seoul Summit 2024, building on the Bletchley Declaration. Companies pledge to publish safety ...governancepolicyai-safetyevaluation+6Source ↗ achieved a historic first: 16 frontier AI companies from the US, Europe, Middle East, and Asia signed binding-intent commitments. Signatories included Amazon, Anthropic, Cohere, G42, Google, IBM, Inflection AI, Meta, Microsoft, Mistral AI, Naver, OpenAI, Samsung, Technology Innovation Institute, xAI, and Zhipu.ai.
| Commitment | Description | Compliance Verification |
|---|---|---|
| Safety Framework Publication | Publish framework by France Summit 2025 | Public disclosure |
| Pre-deployment Evaluations | Test models for severe risks before deployment | Self-reported system cards |
| Dangerous Capability Reporting | Report discoveries to governments and other labs | Voluntary disclosure |
| Non-deployment Commitment | Do not deploy if risks cannot be mitigated | Self-assessed |
| Red-teaming | Internal and external adversarial testing | Third-party verification emerging |
| Cybersecurity | Protect model weights from theft | Industry standards |
Follow-up: An additional 4 companies have joined since May 2024. The France AI Action Summit↗🔗 webKey Outcomes of the AI Seoul SummitPublished by techUK (UK technology industry association), this summary covers the May 2024 AI Seoul Summit outcomes, useful for tracking the evolution of international AI governance frameworks and voluntary company safety commitments in the post-Bletchley era.This techUK resource summarizes the major outcomes of the AI Seoul Summit held in May 2024, covering international agreements, safety commitments, and policy developments among ...governancepolicycoordinationai-safety+3Source ↗ (February 2025) reviewed compliance and expanded commitments.
Third-Party Evaluation Ecosystem
METR↗🔗 web★★★★☆METRMETR: Model Evaluation and Threat ResearchMETR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvem...evaluationred-teamingcapabilitiesai-safety+5Source ↗ (Model Evaluation and Threat Research) has emerged as the leading independent evaluator, having conducted pre-deployment assessments for both Anthropic and OpenAI. Founded by Beth Barnes (former OpenAI alignment researcher) in December 2023, METR does not accept compensation for evaluations to maintain independence.
| Organization | Role | Labs Evaluated | Key Focus Areas |
|---|---|---|---|
| METR↗🔗 web★★★★☆METRMETR: Model Evaluation and Threat ResearchMETR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvem...evaluationred-teamingcapabilitiesai-safety+5Source ↗ | Third-party capability evals | Anthropic, OpenAI | Dangerous capability evaluations↗🔗 web★★★★☆METR2024 01 11 Dangerous Capability EvaluationsPublished by METR (Model Evaluation & Threat Research), this post is a foundational reference for understanding how dangerous capability evaluations work in practice and is directly relevant to frontier lab safety commitments and government AI policy discussions.METR (formerly ARC Evals) describes their framework for evaluating potentially dangerous capabilities in frontier AI models, including autonomous replication, acquiring resource...evaluationred-teamingcapabilitiestechnical-safety+5Source ↗, autonomous agent tasks |
| Apollo Research | Alignment and scheming evals | Anthropic, Google | In-context scheming, deceptive alignment detection |
| UK AI Safety Institute | Government evaluation body | Multiple labs | Independent testing, joint evaluation protocols |
| US AI Safety Institute (NIST) | US government coordination | Multiple labs | AISIC consortium, standards development |
METR's Role: METR's GPT-4.5 pre-deployment evaluation↗🔗 web★★★★☆METRMETR’s GPT-4.5 pre-deployment evaluationsPublished by METR (Model Evaluation and Threat Research), this report is part of their ongoing frontier model evaluation program; relevant for understanding how pre-deployment capability evaluations inform responsible scaling policies and deployment decisions for GPT-4.5.METR conducted pre-deployment autonomous capability evaluations of OpenAI's GPT-4.5, assessing its potential for dangerous self-replication, resource acquisition, and general au...evaluationdangerous-capabilitiesautonomous-replicationdeployment+5Source ↗ piloted a new form of third-party oversight: verifying developers' internal evaluation results rather than conducting fully independent assessments. This approach may scale better while maintaining accountability.
Coverage Gap: As of late 2025, METR's analysis↗🔗 web★★★★☆METRMETR's analysis of 12 companiesPublished by METR (Model Evaluation and Threat Research), this comparative analysis is useful for those tracking industry self-governance and responsible scaling policy developments across major AI labs.METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced...ai-safetygovernancepolicyevaluation+6Source ↗ found that while 12 companies have published frontier safety policies, third-party evaluation coverage remains inconsistent, with most evaluations occurring only for the largest US labs.
Limitations and Challenges
Structural Issues
| Issue | Description | Severity |
|---|---|---|
| Voluntary | No legal enforcement mechanism | High |
| Self-defined thresholds | Labs set their own standards | High |
| Competitive pressure | Incentive to interpret permissively | High |
| Evaluation limitations | Evals may miss important risks | High |
| Public commitment only | Limited verification of compliance | Medium |
| Evolving policies | Policies can be changed by labs | Medium |
The Evaluation Problem
RSPs are only as good as the evaluations that trigger them:
| Challenge | Explanation |
|---|---|
| Unknown risks | Can't test for capabilities we haven't imagined |
| Sandbagging | Models might hide capabilities during evaluation |
| Elicitation difficulty | True capabilities may not be revealed |
| Threshold calibration | Hard to know where thresholds should be |
| Deceptive alignment | Sophisticated models may game evaluations |
Competitive Dynamics
| Scenario | Lab Behavior | Safety Outcome |
|---|---|---|
| Mutual commitment | All labs follow RSPs | Good |
| One defector | Others follow, one cuts corners | Bad (defector advantages) |
| Many defectors | Race to bottom | Very Bad |
| External pressure | Regulation enforces standards | Potentially Good |
Key Cruxes
Summary of Disagreements
| Crux | Optimistic View | Pessimistic View | Key Evidence |
|---|---|---|---|
| Lab Commitment | Reputational stake, genuine safety motivation | No enforcement, commercial pressure dominates | 0 documented major violations; 3 procedural issues self-reported |
| Threshold Appropriateness | Expert judgment, iterative improvement | Conflict of interest, designed non-binding | SaferAI grades 1.9-2.2/5 for specificity |
| Evaluation Effectiveness | 5+ pre-deployment evals conducted; science improving | Can't detect unknown unknowns; sandbagging possible | METR found o3 "prone to reward hacking" |
| Competitive Dynamics | Mutual commitment creates equilibrium | Race to bottom under pressure | 3+ frontier labs; ≈7-month capability doubling |
| Timeline | Governance can keep pace | Capabilities outrun safeguards | 20 policies published in 26 months |
Crux 1: Will Labs Honor Their Commitments?
| Position: Yes | Position: No |
|---|---|
| Reputational stake in commitment | Competitive pressure to continue |
| Some genuine safety motivation | No enforcement mechanism |
| Third-party verification helps | History of moving goalposts |
| Public accountability creates pressure | Commercial interests dominate |
Crux 2: Are RSP Thresholds Set Appropriately?
| Position: Appropriate | Position: Too Permissive |
|---|---|
| Based on expert judgment | Labs set their own standards |
| Updated as understanding improves | Conflict of interest |
| Better than no thresholds | May be designed to be non-binding |
| Include safety margins | Racing pressure to minimize |
Crux 3: Can Evaluations Trigger RSPs Effectively?
| Position: Yes | Position: No |
|---|---|
| Eval science is improving | Can't detect what we don't test for |
| Third-party evals add accountability | Deceptive models could sandbag |
| Explicit triggers create clarity | Thresholds may be wrong |
| Better than pure judgment calls | Gaming evaluations is incentivized |
Analysis of RSP Effectiveness
Quantitative Evidence
| Metric | Value | Source | Trend |
|---|---|---|---|
| Companies with published policies | 20 (Dec 2025) | METR Common Elements | ↑ from 12 in May 2024 |
| Seoul Summit signatories | 16 (May 2024) | UK Gov | +4 since summit |
| Third-party pre-deployment evals | 5+ models (2024-25) | METR | GPT-4.5, Claude 3.5, o3, o4-mini |
| SaferAI Policy Grades | 1.9-2.2/5 | SaferAI | All major labs in "weak" category |
| Capability doubling time | ≈7 months | METR | Task length agents can complete |
| Lab-reported compliance issues | 3+ procedural | Anthropic RSP | Self-reported in 2024 review |
| Models at elevated safety levels | 3 (Claude Opus 4, 4.1, Sonnet 4.5) | Anthropic | ASL-3 activated May 2025 |
Strengths
| Strength | Explanation |
|---|---|
| Explicit commitments | Creates accountability through specificity |
| Public pressure | Visible commitments enable monitoring |
| Third-party verification | External evaluation adds credibility |
| Adaptive framework | Can update as understanding improves |
| Industry coordination | Creates shared standards |
Weaknesses
| Weakness | Explanation |
|---|---|
| Voluntary nature | No legal consequences for violations |
| Self-defined thresholds | Conflict of interest in setting standards |
| Competitive pressure | Racing incentives undermine commitment |
| Evaluation limitations | Evals may not catch real dangers |
| Policy evolution | Labs can change policies over time |
What Would Improve RSPs?
Near-Term Improvements
| Improvement | Mechanism | Feasibility |
|---|---|---|
| Third-party verification | Independent audit of compliance | High |
| Standardized thresholds | Industry-wide capability definitions | Medium |
| Mandatory reporting | Legal requirements for disclosure | Medium |
| Binding commitments | Legal liability for violations | Low-Medium |
| International coordination | Cross-border standards | Low |
Longer-Term Vision
| Improvement | Description |
|---|---|
| Regulatory backstop | Government enforcement if voluntary fails |
| Standardized evals | Shared evaluation suites across labs |
| International treaty | Binding international commitments |
| Continuous verification | Ongoing monitoring rather than point-in-time |
Who Should Work on This?
Good fit if you believe:
- Industry self-governance can work with proper incentives
- Creating accountability structures is valuable
- Incremental governance improvements help
- RSPs can evolve into stronger mechanisms
Less relevant if you believe:
- Voluntary commitments are inherently unreliable
- Labs will never meaningfully constrain themselves
- Focus should be on mandatory regulation
- Evaluations can't capture real risks
Sources & Resources
Primary Policy Documents
| Document | Organization | Latest Version | URL |
|---|---|---|---|
| Responsible Scaling Policy | Anthropic | v2.2 (May 2025) | anthropic.com/responsible-scaling-policy↗🔗 web★★★★☆AnthropicResponsible Scaling PolicyAnthropic's RSP is a foundational industry document for responsible development commitments; frequently cited in AI governance discussions as a model for 'if-then' safety commitments from frontier labs.Anthropic's Responsible Scaling Policy (RSP) is a formal commitment outlining how the company will evaluate AI systems for dangerous capabilities and adjust deployment and devel...governancepolicyai-safetyevaluation+5Source ↗ |
| RSP Announcement & Updates | Anthropic | Ongoing | anthropic.com/news/rsp-updates↗🔗 web★★★★☆AnthropicAnthropic: Announcing our updated Responsible Scaling PolicyAnthropic's RSP is a landmark industry governance document; this update is significant for AI safety practitioners tracking how frontier labs operationalize safety commitments and capability-gated deployment policies.Anthropic announces an updated version of its Responsible Scaling Policy (RSP), a framework that ties AI development and deployment decisions to specific capability thresholds c...governancepolicycapabilitiesevaluation+5Source ↗ |
| Preparedness Framework | OpenAI | v2.0 (Apr 2025) | cdn.openai.com/preparedness-framework-v2.pdf↗🔗 web★★★★☆OpenAIOpenAI: Preparedness Framework Version 2OpenAI's official internal safety governance framework, updated as v2; relevant for understanding how a leading frontier AI lab operationalizes risk thresholds and pre-deployment evaluation processes.OpenAI's Preparedness Framework v2 outlines the company's structured approach to evaluating and managing catastrophic risks from frontier AI models, including definitions of ris...governanceevaluationdeploymentai-safety+6Source ↗ |
| Frontier Safety Framework | Google DeepMind | v3.0 (Oct 2025) | deepmind.google/frontier-safety-framework↗🔗 webGoogle DeepMind: Frontier Safety Framework Version 3.0This is Google DeepMind's official safety framework document (version 3.0), analogous to Anthropic's RSP and OpenAI's Preparedness Framework; essential reference for understanding industry approaches to frontier AI risk management.Google DeepMind's Frontier Safety Framework (v3.0) defines protocols for identifying Critical Capability Levels (CCLs) at which frontier AI models may pose severe risks, and out...ai-safetygovernancepolicyevaluation+6Source ↗ |
| Seoul Summit Commitments | UK Government | May 2024 | gov.uk/frontier-ai-safety-commitments↗🏛️ government★★★★☆UK GovernmentSeoul Frontier AI CommitmentsOfficial UK government publication documenting voluntary safety pledges from frontier AI companies at the 2024 Seoul AI Summit; a key milestone in international AI governance efforts following the 2023 Bletchley Park Summit.A collection of voluntary safety commitments made by leading AI companies at the AI Seoul Summit 2024, building on the Bletchley Declaration. Companies pledge to publish safety ...governancepolicyai-safetyevaluation+6Source ↗ |
Analysis & Commentary
| Source | Focus | Key Finding |
|---|---|---|
| METR: Common Elements Analysis↗🔗 web★★★★☆METRMETR's analysis of 12 companiesPublished by METR (Model Evaluation and Threat Research), this comparative analysis is useful for those tracking industry self-governance and responsible scaling policy developments across major AI labs.METR analyzes the safety policies of 12 frontier AI companies to identify common elements, commitments, and gaps in how organizations approach responsible deployment of advanced...ai-safetygovernancepolicyevaluation+6Source ↗ | Cross-lab comparison | 12 companies published policies; significant variation in specificity |
| SaferAI: RSP Update Critique↗🔗 webAnthropic's Responsible Scaling Policy Update Makes a Step BackwardsPublished by Safer AI, this piece is a critical external review of Anthropic's 2024 RSP update, relevant to discussions on frontier AI governance, corporate accountability, and the credibility of voluntary safety commitments.This analysis critiques Anthropic's updated Responsible Scaling Policy (RSP) for reducing specificity and moving away from concrete, quantitative safety thresholds toward more q...governancepolicyevaluationai-safety+3Source ↗ | Anthropic v2.0 | Reduced specificity from quantitative to qualitative thresholds |
| FAS: Can Preparedness Frameworks Pull Their Weight?↗🔗 webCan Preparedness Frameworks Pull Their Weight?Published by the Federation of American Scientists, this policy-oriented analysis is relevant for those tracking how AI labs' internal safety frameworks (e.g., OpenAI's Preparedness Framework) are being scrutinized and whether external governance mechanisms are needed to ensure accountability at scale.This Federation of American Scientists publication examines whether current AI preparedness frameworks—such as those adopted by major AI labs—are adequate for managing risks as ...ai-safetygovernancepolicyevaluation+5Source ↗ | Framework effectiveness | Questions whether voluntary commitments can constrain behavior |
| METR: RSP Analysis (2023)↗🔗 web★★★★☆METRMETR: Responsible Scaling PoliciesMETR (formerly ARC Evals) is a key organization in AI safety evaluations; this post originated much of the RSP discourse later adopted by Anthropic, Google DeepMind, and others, and is a foundational reference for understanding the RSP/ASL policy landscape.METR introduces and advocates for Responsible Scaling Policies (RSPs), a framework requiring AI developers to specify capability thresholds at which it becomes unsafe to continu...ai-safetygovernancepolicyevaluation+5Source ↗ | Original RSP assessment | Early evaluation of the RSP concept and implementation |
Third-Party Evaluation Resources
- METR↗🔗 web★★★★☆METRMETR: Model Evaluation and Threat ResearchMETR is a leading third-party AI safety evaluation organization whose work on autonomous capability benchmarks and catastrophic risk assessments directly informs AI lab safety policies and government AI governance frameworks.METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvem...evaluationred-teamingcapabilitiesai-safety+5Source ↗: Primary third-party evaluator for frontier models
- METR Dangerous Capability Evaluations↗🔗 web★★★★☆METR2024 01 11 Dangerous Capability EvaluationsPublished by METR (Model Evaluation & Threat Research), this post is a foundational reference for understanding how dangerous capability evaluations work in practice and is directly relevant to frontier lab safety commitments and government AI policy discussions.METR (formerly ARC Evals) describes their framework for evaluating potentially dangerous capabilities in frontier AI models, including autonomous replication, acquiring resource...evaluationred-teamingcapabilitiestechnical-safety+5Source ↗: Methodology for capability assessment
- METR GPT-4.5 Pre-deployment Evals↗🔗 web★★★★☆METRMETR’s GPT-4.5 pre-deployment evaluationsPublished by METR (Model Evaluation and Threat Research), this report is part of their ongoing frontier model evaluation program; relevant for understanding how pre-deployment capability evaluations inform responsible scaling policies and deployment decisions for GPT-4.5.METR conducted pre-deployment autonomous capability evaluations of OpenAI's GPT-4.5, assessing its potential for dangerous self-replication, resource acquisition, and general au...evaluationdangerous-capabilitiesautonomous-replicationdeployment+5Source ↗: Example of third-party verification process
Key Critiques
| Critique | Explanation | Counterargument |
|---|---|---|
| Voluntary and unenforceable | No legal mechanism to ensure compliance | Reputational costs and potential regulatory backstop |
| Labs set their own thresholds | Inherent conflict of interest | Third-party input and public accountability |
| Competitive pressure | Racing dynamics undermine commitment | Mutual commitment creates coordination equilibrium |
| Evaluation limitations | Can't test for unknown capabilities | Improving eval science; multiple redundant assessments |
| Policy evolution | Labs can weaken policies over time | Public tracking; external pressure for strengthening |
Evaluation Methodologies
RSP effectiveness depends on the quality of evaluations that trigger safeguard requirements. Current approaches include:
Capability Evaluation Approaches
| Evaluation Type | Description | Strengths | Weaknesses |
|---|---|---|---|
| Benchmark suites | Standardized tests (MMLU, HumanEval, etc.) | Reproducible, comparable | May not capture dangerous capabilities |
| Red-teaming | Adversarial testing by experts | Finds real-world attack vectors | Expensive, not comprehensive |
| Uplift studies | Compare AI-assisted vs. unassisted task completion | Directly measures counterfactual risk | Hard to simulate real adversaries |
| Autonomous agent tasks | Long-horizon task completion↗🔗 web★★★★☆METRMeasuring AI Ability to Complete Long Tasks - METRPublished by METR (Model Evaluation and Threat Research) in March 2025, this research is directly relevant to AI safety evaluations and informing thresholds for capability-based deployment decisions and governance frameworks.METR presents empirical research showing that AI models' ability to complete increasingly long autonomous tasks is growing exponentially, with the maximum task length that model...capabilitiesevaluationai-safetydeployment+3Source ↗ | Tests agentic capabilities | Scaffolding matters; hard to standardize |
| Expert knowledge tests | Domain-specific Q&A (e.g., virology) | Measures depth in dangerous domains | Experts may not know all dangerous knowledge |
Key Metrics and Thresholds
| Metric | Current Benchmark | ASL-3 Trigger (Anthropic) | High Trigger (OpenAI) |
|---|---|---|---|
| Bio knowledge | Expert-level Q&A | Exceeds 95th percentile virologist | Meaningful uplift for novices |
| Cyber capability | CTF performance | Autonomous exploitation of hardened targets | Scaled attack assistance |
| AI R&D automation | RE-Bench performance↗🔗 web★★★★☆METRRE-Bench: Evaluating frontier AI R&D capabilitiesPublished by METR (Model Evaluation and Threat Research) in November 2024, RE-Bench is a key tool for tracking when AI systems might be able to substantially accelerate their own development—a critical threshold for AI safety planning.METR introduces RE-Bench, a benchmark designed to evaluate the ability of frontier AI models to autonomously conduct machine learning research and development tasks. The benchma...capabilitiesevaluationai-safetyred-teaming+3Source ↗ | Substantially accelerates timeline | 5x speedup threshold |
| Autonomous task length | 1-hour tasks | Multi-day autonomous operation | Extended resource acquisition |
Evaluation Limitations
Current evaluations face fundamental challenges that limit RSP effectiveness:
- Unknown unknowns: Cannot test for capabilities not yet imagined
- Sandbagging risk: Models may underperform intentionally during evaluation
- Elicitation gap: True capabilities may exceed measured capabilities
- Threshold calibration: Optimal threshold placement is uncertain
- Combinatorial risks: Safe capabilities may combine dangerously
References
METR analyzes the common structural elements across frontier AI safety policies published by major AI companies, identifying shared frameworks around capability thresholds, model evaluations, weight security, deployment mitigations, and accountability mechanisms. The December 2025 version covers twelve companies including Anthropic, OpenAI, Google DeepMind, Meta, and others, and incorporates references to the EU AI Act's General-Purpose AI Code of Practice and California's Senate Bill 53.
METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.
SaferAI critiques Anthropic's updated Responsible Scaling Policy (RSP), arguing that recent revisions weaken safety commitments rather than strengthening them. The analysis contends that the updated policy relaxes key thresholds and evaluation requirements, reducing accountability for frontier AI deployment. This represents a critical external perspective on how voluntary safety frameworks can erode over time.
A collection of voluntary safety commitments made by leading AI companies at the AI Seoul Summit 2024, building on the Bletchley Declaration. Companies pledge to publish safety frameworks, conduct pre-deployment evaluations, share safety information, and establish responsible scaling thresholds before deploying frontier AI models.
This page documents Anthropic's Responsible Scaling Policy (RSP), a framework that ties AI development and deployment decisions to demonstrated capability thresholds and corresponding safety measures. It outlines commitments to pause or restrict scaling if AI systems reach certain dangerous capability levels without adequate safeguards, and tracks updates to the policy over time.