Constitutional Classifiers
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
Published by Anthropic, this work extends Constitutional AI principles to inference-time safety classifiers, offering a practical defense mechanism against jailbreak attempts relevant to deployed AI safety research.
Metadata
Summary
Anthropic introduces Constitutional Classifiers, a system that uses constitutional principles to train input/output classifiers that defend against universal jailbreaks attempting to extract harmful information. The approach demonstrates strong robustness against automated and human red-teaming efforts while maintaining low false positive rates, representing a practical safety layer for deployed AI systems.
Key Points
- •Constitutional Classifiers use a set of natural language principles (a 'constitution') to synthetically generate training data for safety classifiers
- •The system defends against universal jailbreaks—attacks designed to bypass safety measures across a wide range of harmful prompts
- •Red-team evaluations showed no successful universal jailbreaks were found, while false positive rates on benign prompts remained low
- •The approach extends Anthropic's Constitutional AI methodology from training to inference-time classification
- •Classifiers are designed to be robust even against adversaries with knowledge of the classifier's constitution
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| AI-Assisted Alignment | Approach | 63.0 |
| Constitutional AI | Approach | 70.0 |
Cached Content Preview
Alignment
# Constitutional Classifiers: Defending against universal jailbreaks
Feb 3, 2025
_A [new paper](https://arxiv.org/abs/2501.18837) from the Anthropic Safeguards Research Team describes a method that defends AI models against universal jailbreaks. A prototype version of the method was robust to thousands of hours of human red teaming for universal jailbreaks, albeit with high overrefusal rates and compute overhead. An updated version achieved similar robustness on synthetic evaluations, and did so with a 0.38% increase in refusal rates and moderate additional compute costs._
Large language models have extensive safety training to prevent harmful outputs. For example, we train Claude to refuse to respond to user queries involving the production of biological or chemical weapons.
Nevertheless, models are still vulnerable to _jailbreaks_: inputs designed to bypass their safety guardrails and force them to produce harmful responses. Some jailbreaks flood the model with [very long prompts](https://www.anthropic.com/research/many-shot-jailbreaking); others modify the [style of the input](https://arxiv.org/abs/2412.03556), such as uSiNg uNuSuAl cApItALiZaTiOn. Historically, jailbreaks have proved difficult to detect and block: these kinds of attacks were [described over 10 years ago](https://arxiv.org/abs/1312.6199), yet to our knowledge there are still no fully robust deep-learning models in production.
We’re developing better jailbreak defenses so that we can safely deploy increasingly capable models in the future. Under our [Responsible Scaling Policy](https://www.anthropic.com/news/announcing-our-updated-responsible-scaling-policy), we may deploy such models as long as we’re able to mitigate risks to acceptable levels through appropriate safeguards—but jailbreaking lets users bypass these safeguards. In particular, we’re hopeful that a system defended by Constitutional Classifiers could allow us to mitigate jailbreaking risks for models which have passed the CBRN capability threshold outlined in our Responsible Scaling Policy1.
In [**our new paper**](https://arxiv.org/abs/2501.18837), we describe a system based on _Constitutional Classifiers_ that guards models against jailbreaks. These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead.
## Results from human red teaming
We ran two main categories of tests to assess the effectiveness of Constitutional Classifiers.
First, we developed a prototype version of the system to identify and block specific scientific knowledge related to chemical, biological, radiological, and nuclear harms. We then invited independent jailbreakers to a [bug-bounty program](https://www.anthropic.com/news/model-safety-bug-bounty) in which they were challenged to “red team” the system (i.e., to attempt to break it under experimental cond
... (truncated, 21 KB total)7c3cb789d06c4384 | Stable ID: NWEwOTM3Mj