Anthropic Responsible Scaling Policy 2024 10 15

web

Anthropic·assets.anthropic.com/m/24a47b00f10301cd/original/Anthropi...

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

This is the October 2024 revision of Anthropic's RSP, a foundational industry governance document that directly shapes how Anthropic decides when and whether to deploy frontier models, making it essential reading for understanding frontier lab safety commitments.

Metadata

Importance: 88/100policy briefprimary source

Summary

Anthropic's Responsible Scaling Policy (RSP) establishes a framework of AI Safety Levels (ASLs) that tie model deployment and development decisions to demonstrated safety and security standards. It commits Anthropic to evaluating frontier models for dangerous capabilities thresholds and mandating corresponding protective measures before scaling further. The policy represents a concrete industry attempt to operationalize safety commitments through binding internal governance.

Key Points

•Defines AI Safety Levels (ASL-1 through ASL-4+) analogous to biosafety levels, with each tier requiring stricter safety and security measures
•Mandates capability evaluations before and after training runs to determine if models cross dangerous thresholds (e.g., CBRN uplift, autonomous replication)
•Includes a 'if in doubt, don't deploy' principle: models must be shown safe enough before release rather than safe by default assumption
•Specifies concrete required safeguards for ASL-3 systems including enhanced security against model theft and stronger deployment restrictions
•Represents a self-imposed, externally-communicable commitment intended to set industry norms and invite third-party accountability

Cited by 2 pages

Page	Type	Quality
Anthropic Core Views	Safety Agenda	62.0
Capability Elicitation	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202663 KB

Responsible
Scaling Policy
Eective October 15, 2024
Supplementary info available at www.anthropic.com/rsp-updates

-- 1 of 22 --

Executive Summary
In September 2023, we released our Responsible Scaling Policy (RSP), a public commitment not to train or
deploy models capable of causing catastrophic harm unless we have implemented safety and security
measures that will keep risks below acceptable levels. We are now updating our RSP to account for the
lessons we’ve learned over the last year. This updated policy reflects our view that risk governance in this
rapidly evolving domain should be proportional, iterative, and exportable.
Background. AI Safety Level Standards (ASL Standards) are a set of technical and operational measures for
safely training and deploying frontier AI models. These currently fall into two categories: Deployment
Standards and Security Standards. As model capabilities increase, so will the need for stronger safeguards,
which are captured in successively higher ASL Standards. At present, all of our models must meet the
ASL-2 Deployment and Security Standards. To determine when a model has become suciently advanced
such that its deployment and security measures should be strengthened, we use the concepts of Capability
Thresholds and Required Safeguards. A Capability Threshold tells us when we need to upgrade our
protections, and the corresponding Required Safeguards tell us what standard should apply.
Capability Thresholds and Required Safeguards. The Required Safeguards for each Capability Threshold
are intended to mitigate risk to acceptable levels. This update to our RSP provides specifications for
Capabilities Thresholds related to Chemical, Biological, Radiological, and Nuclear (CBRN) weapons and
Autonomous AI Research and Development (AI R&D) and identifies the corresponding Required
Safeguards.
Capability assessment. We will routinely test models to determine whether their capabilities fall
suciently far below the Capability Thresholds such that the ASL-2 Standard remains appropriate. We
will first conduct preliminary assessments to determine whether a more comprehensive evaluation is
needed. For models requiring comprehensive testing, we will assess whether the model is unlikely to reach
any relevant Capability Thresholds absent surprising advances in widely accessible post-training
enhancements. If, after the comprehensive testing, we determine that the model is suciently below the
relevant Capability Thresholds, then we will continue to apply the ASL-2 Standard. If, however, we are
unable to make the required showing, we will act as though the model has surpassed the Capability
Threshold. This means that we will both upgrade to the ASL-3 Required Safeguards and conduct a
follow-up capability assessment to confirm that the ASL-4 Standard is not necessary.
Safeguards assessment. To determine whether the measures we have adopted satisfy the ASL-3 Required
Safeguards, we will conduct a safeguards assessment. For the ASL-3

... (truncated, 63 KB total)

Resource ID: 135450f83343d9ae | Stable ID: sid_knkLBUH9zs