Constitutional Classifiers

web

Anthropic·anthropic.com/news/constitutional-classifiers

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

Published by Anthropic, this work extends Constitutional AI principles to inference-time safety classifiers, offering a practical defense mechanism against jailbreak attempts relevant to deployed AI safety research.

Metadata

Importance: 72/100blog postprimary source

Summary

Anthropic introduces Constitutional Classifiers, a system that uses constitutional principles to train input/output classifiers that defend against universal jailbreaks attempting to extract harmful information. The approach demonstrates strong robustness against automated and human red-teaming efforts while maintaining low false positive rates, representing a practical safety layer for deployed AI systems.

Key Points

•Constitutional Classifiers use a set of natural language principles (a 'constitution') to synthetically generate training data for safety classifiers
•The system defends against universal jailbreaks—attacks designed to bypass safety measures across a wide range of harmful prompts
•Red-team evaluations showed no successful universal jailbreaks were found, while false positive rates on benign prompts remained low
•The approach extends Anthropic's Constitutional AI methodology from training to inference-time classification
•Classifiers are designed to be robust even against adversaries with knowledge of the classifier's constitution

Cited by 2 pages

Page	Type	Quality
AI-Assisted Alignment	Approach	63.0
Constitutional AI	Approach	70.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202618 KB

Alignment Constitutional Classifiers: Defending against universal jailbreaks

 Feb 3, 2025 A new paper from the Anthropic Safeguards Research Team describes a method that defends AI models against universal jailbreaks. A prototype version of the method was robust to thousands of hours of human red teaming for universal jailbreaks, albeit with high overrefusal rates and compute overhead. An updated version achieved similar robustness on synthetic evaluations, and did so with a 0.38% increase in refusal rates and moderate additional compute costs. 

 Large language models have extensive safety training to prevent harmful outputs. For example, we train Claude to refuse to respond to user queries involving the production of biological or chemical weapons.

 Nevertheless, models are still vulnerable to jailbreaks : inputs designed to bypass their safety guardrails and force them to produce harmful responses. Some jailbreaks flood the model with very long prompts ; others modify the style of the input , such as uSiNg uNuSuAl cApItALiZaTiOn. Historically, jailbreaks have proved difficult to detect and block: these kinds of attacks were described over 10 years ago , yet to our knowledge there are still no fully robust deep-learning models in production.

 We’re developing better jailbreak defenses so that we can safely deploy increasingly capable models in the future. Under our Responsible Scaling Policy , we may deploy such models as long as we’re able to mitigate risks to acceptable levels through appropriate safeguards—but jailbreaking lets users bypass these safeguards. In particular, we’re hopeful that a system defended by Constitutional Classifiers could allow us to mitigate jailbreaking risks for models which have passed the CBRN capability threshold outlined in our Responsible Scaling Policy 1 .

 In our new paper , we describe a system based on Constitutional Classifiers that guards models against jailbreaks. These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead.

 Results from human red teaming

 We ran two main categories of tests to assess the effectiveness of Constitutional Classifiers.

 First, we developed a prototype version of the system to identify and block specific scientific knowledge related to chemical, biological, radiological, and nuclear harms. We then invited independent jailbreakers to a bug-bounty program in which they were challenged to “red team” the system (i.e., to attempt to break it under experimental conditions to test its robustness).

 Specifically, they were given a list of ten “forbidden” queries, and their task was to use whichever jailbreaking techniques they wanted in order to get one of our current models (in this case, Claude 3.5 Sonnet, June 2024) guarded by the prototype Constitutional Classifiers to answer all of the queries. We onl

... (truncated, 18 KB total)

Resource ID: 7c3cb789d06c4384 | Stable ID: sid_kqOKrGfeTQ