Anthropic ASL-3 Deployment Safeguards (https://anthropic.com/asl3-deployment-safeguards)

web

Anthropic·anthropic.com/asl3-deployment-safeguards

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

This is Anthropic's official compliance report for their Responsible Scaling Policy's ASL-3 threshold, providing a rare concrete example of how a frontier lab operationalizes safety standards for models with serious CBRN uplift potential.

Metadata

Importance: 78/100guidance documentprimary source

Summary

Anthropic's public report detailing the safeguards implemented to meet the AI Safety Level 3 (ASL-3) Deployment Standard under their Responsible Scaling Policy, focused on preventing misuse for CBRN (chemical, biological, radiological, nuclear) weapons development. The report describes real-time classifiers, offline jailbreak detection, vetted user access, bug bounty programs, and threat intelligence contracts. It presents evidence from red-teaming that these measures substantially raise the difficulty of extracting harmful information.

Key Points

•ASL-3 Deployment Standard requires 'robust' measures against persistent misuse; operationalized as blocking AI-assisted CBRN weapons development pathways.
•Threat model focuses on actors needing sustained multi-query guidance (weeks/months) and likely resorting to 'highly effective universal jailbreaks' to circumvent safeguards.
•Safeguards include real-time classifiers, offline jailbreak detectors, manually vetted high-trust user access, bug bounty rewards, and third-party threat intelligence monitoring.
•Red-teaming evidence shows current defenses substantially increase time/skill required for misuse; a small number of effective jailbreaks were found and remediated.
•Report is a redacted public version; the unredacted internal version omits proprietary details and threat-enabling information.

Cached Content Preview

HTTP 200Fetched Feb 23, 202648 KB

AI Safety Level 3
Deployment
Safeguards
Report
May 2025
anthropic.com

-- 1 of 25 --

This is a redacted version of the AI Safety Level 3 (ASL-3) Safeguards Report prepared per our
Responsible Scaling Policy's safeguards assessment obligation. The unredacted
Anthropic-internal version contains more detail related to sensitive content such as
proprietary company information, information that could enable threat actors, and
preliminary analyses subject to ongoing refinement.
Executive Summary
This report details our efforts at meeting the AI Safety Level 3 (ASL-3) Deployment
Standard laid out in our Responsible Scaling Policy, with respect to CBRN capabilities.1 It
argues that the safeguards discussed here qualify for this standard.
Our approach to safety is iterative and ongoing. We expect new vulnerabilities and
limitations of our safeguards will emerge, and we expect to continuously improve our
safeguards based on real-world experience and ongoing safety testing.
Operationalizing the standard. The ASL-3 Deployment Standard requires us to adopt
measures that “make us robust to persistent attempts to misuse” an AI model. We
operationalize this as a requirement to (1) identify the most salient and high-likelihood
paths by which AI could help individuals to create/obtain and deploy CBRN weapons; and
(2) implement significant obstacles to these paths to harm.
Threat model. We have focused on risks of threat actors using guidance from AI models to
develop and deploy weapons of concern. We expect that these threat actors will require
substantial and persistent guidance, involving dozens of queries over extended periods of
time (weeks if not months). Given model safeguards implemented to prevent threat actors
from obtaining such guidance, the threat actors would likely need to employ “jailbreak”
techniques to circumvent these protections. The most concerning techniques are “highly
effective universal jailbreaks” that extract detailed, accurate, and comprehensive
information for the majority of questions within a domain, despite the safeguards in place.
Safeguards plan. We implement the following measures in order to meet the standard:
● We implement real-time classifier guards trained to block uses of concern.
● We use more powerful offline classifiers to identify potential jailbreaks and jailbreak
attempts.
1 For more information on our implementation of the ASL-3 Deployment and Security Standards, see
our report, “Activating AI Safety Level 3 Protections.”
2

-- 2 of 25 --

● We allow some users to access models whose output isn’t restricted by these
classifiers. We manually vet these users, examining organizational legitimacy,
compliance with our policies, and information security practices.
● We operate a bug bounty program with substantial rewards for reporting universal
jailbreaks of our defenses.
● We contract with third-party vendors for threat intelligence: monitoring public
forums, black markets, and other sources for signs of jailbreaks and/o

... (truncated, 48 KB total)

Resource ID: 75d5c32883c85b64 | Stable ID: sid_GYHYM1uw18