Skip to content
Longterm Wiki
Back

Anthropic Responsible Scaling Policy 2024 10 15

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

This is the October 2024 revision of Anthropic's RSP, a foundational industry governance document that directly shapes how Anthropic decides when and whether to deploy frontier models, making it essential reading for understanding frontier lab safety commitments.

Metadata

Importance: 88/100policy briefprimary source

Summary

Anthropic's Responsible Scaling Policy (RSP) establishes a framework of AI Safety Levels (ASLs) that tie model deployment and development decisions to demonstrated safety and security standards. It commits Anthropic to evaluating frontier models for dangerous capabilities thresholds and mandating corresponding protective measures before scaling further. The policy represents a concrete industry attempt to operationalize safety commitments through binding internal governance.

Key Points

  • Defines AI Safety Levels (ASL-1 through ASL-4+) analogous to biosafety levels, with each tier requiring stricter safety and security measures
  • Mandates capability evaluations before and after training runs to determine if models cross dangerous thresholds (e.g., CBRN uplift, autonomous replication)
  • Includes a 'if in doubt, don't deploy' principle: models must be shown safe enough before release rather than safe by default assumption
  • Specifies concrete required safeguards for ASL-3 systems including enhanced security against model theft and stronger deployment restrictions
  • Represents a self-imposed, externally-communicable commitment intended to set industry norms and invite third-party accountability

Cited by 2 pages

PageTypeQuality
Anthropic Core ViewsSafety Agenda62.0
Capability ElicitationApproach91.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202656 KB
ANTHROP\\C

# Responsible Scaling Policy

Effective October 15, 2024

# Executive Summary

In September 2023, we released our Responsible Scaling Policy (RSP), a public commitment not to train or deploy models capable of causing catastrophic harm unless we have implemented safety and security measures that will keep risks below acceptable levels. We are now updating our RSP to account for the lessons we’ve learned over the last year. This updated policy reflects our view that risk governance in this rapidly evolving domain should be proportional, iterative, and exportable.

Background. AI Safety Level Standards (ASL Standards) are a set of technical and operational measures for safely training and deploying frontier AI models. These currently fall into two categories: Deployment Standards and Security Standards. As model capabilities increase, so will the need for stronger safeguards, which are captured in successively higher ASL Standards. At present, all of our models must meet the ASL-2 Deployment and Security Standards. To determine when a model has become sufficiently advanced such that its deployment and security measures should be strengthened, we use the concepts of Capability Thresholds and Required Safeguards. A Capability Threshold tells us when we need to upgrade our protections, and the corresponding Required Safeguards tell us what standard should apply.

Capability Thresholds and Required Safeguards. The Required Safeguards for each Capability Threshold are intended to mitigate risk to acceptable levels. This update to our RSP provides specifications for Capabilities Thresholds related to Chemical, Biological, Radiological, and Nuclear (CBRN) weapons and Autonomous AI Research and Development (AI R&D) and identifies the corresponding Required Safeguards.

Capability assessment. We will routinely test models to determine whether their capabilities fall sufficiently far below the Capability Thresholds such that the ASL-2 Standard remains appropriate. We will first conduct preliminary assessments to determine whether a more comprehensive evaluation is needed. For models requiring comprehensive testing, we will assess whether the model is unlikely to reach any relevant Capability Thresholds absent surprising advances in widely accessible post-training enhancements. If, after the comprehensive testing, we determine that the model is sufficiently below the relevant Capability Thresholds, then we will continue to apply the ASL-2 Standard. If, however, we are unable to make the required showing, we will act as though the model has surpassed the Capability Threshold. This means that we will both upgrade to the ASL-3 Required Safeguards and conduct a follow-up capability assessment to confirm that the ASL-4 Standard is not necessary.

Safeguards assessment. To determine whether the measures we have adopted satisfy the ASL-3 Required Safeguards, we will conduct a safeguards assessment. For the ASL-3 Deployment Standard, we will evaluate wheth

... (truncated, 56 KB total)
Resource ID: 135450f83343d9ae | Stable ID: Y2YyNjcxNT