Anthropic Responsible Scaling Policy 2024 10 15
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
This is the October 2024 revision of Anthropic's RSP, a foundational industry governance document that directly shapes how Anthropic decides when and whether to deploy frontier models, making it essential reading for understanding frontier lab safety commitments.
Metadata
Summary
Anthropic's Responsible Scaling Policy (RSP) establishes a framework of AI Safety Levels (ASLs) that tie model deployment and development decisions to demonstrated safety and security standards. It commits Anthropic to evaluating frontier models for dangerous capabilities thresholds and mandating corresponding protective measures before scaling further. The policy represents a concrete industry attempt to operationalize safety commitments through binding internal governance.
Key Points
- •Defines AI Safety Levels (ASL-1 through ASL-4+) analogous to biosafety levels, with each tier requiring stricter safety and security measures
- •Mandates capability evaluations before and after training runs to determine if models cross dangerous thresholds (e.g., CBRN uplift, autonomous replication)
- •Includes a 'if in doubt, don't deploy' principle: models must be shown safe enough before release rather than safe by default assumption
- •Specifies concrete required safeguards for ASL-3 systems including enhanced security against model theft and stronger deployment restrictions
- •Represents a self-imposed, externally-communicable commitment intended to set industry norms and invite third-party accountability
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Anthropic Core Views | Safety Agenda | 62.0 |
| Capability Elicitation | Approach | 91.0 |
Cached Content Preview
ANTHROP\\C
# Responsible Scaling Policy
Effective October 15, 2024
# Executive Summary
In September 2023, we released our Responsible Scaling Policy (RSP), a public commitment not to train or deploy models capable of causing catastrophic harm unless we have implemented safety and security measures that will keep risks below acceptable levels. We are now updating our RSP to account for the lessons we’ve learned over the last year. This updated policy reflects our view that risk governance in this rapidly evolving domain should be proportional, iterative, and exportable.
Background. AI Safety Level Standards (ASL Standards) are a set of technical and operational measures for safely training and deploying frontier AI models. These currently fall into two categories: Deployment Standards and Security Standards. As model capabilities increase, so will the need for stronger safeguards, which are captured in successively higher ASL Standards. At present, all of our models must meet the ASL-2 Deployment and Security Standards. To determine when a model has become sufficiently advanced such that its deployment and security measures should be strengthened, we use the concepts of Capability Thresholds and Required Safeguards. A Capability Threshold tells us when we need to upgrade our protections, and the corresponding Required Safeguards tell us what standard should apply.
Capability Thresholds and Required Safeguards. The Required Safeguards for each Capability Threshold are intended to mitigate risk to acceptable levels. This update to our RSP provides specifications for Capabilities Thresholds related to Chemical, Biological, Radiological, and Nuclear (CBRN) weapons and Autonomous AI Research and Development (AI R&D) and identifies the corresponding Required Safeguards.
Capability assessment. We will routinely test models to determine whether their capabilities fall sufficiently far below the Capability Thresholds such that the ASL-2 Standard remains appropriate. We will first conduct preliminary assessments to determine whether a more comprehensive evaluation is needed. For models requiring comprehensive testing, we will assess whether the model is unlikely to reach any relevant Capability Thresholds absent surprising advances in widely accessible post-training enhancements. If, after the comprehensive testing, we determine that the model is sufficiently below the relevant Capability Thresholds, then we will continue to apply the ASL-2 Standard. If, however, we are unable to make the required showing, we will act as though the model has surpassed the Capability Threshold. This means that we will both upgrade to the ASL-3 Required Safeguards and conduct a follow-up capability assessment to confirm that the ASL-4 Standard is not necessary.
Safeguards assessment. To determine whether the measures we have adopted satisfy the ASL-3 Required Safeguards, we will conduct a safeguards assessment. For the ASL-3 Deployment Standard, we will evaluate wheth
... (truncated, 56 KB total)135450f83343d9ae | Stable ID: Y2YyNjcxNT