Constitutional Classifiers: Defending Against Universal Jailbreaks
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
Published by Anthropic, this research presents a practical defense mechanism against universal jailbreaks, relevant to anyone studying adversarial robustness, AI deployment safety, or Constitutional AI methods.
Metadata
Summary
Anthropic introduces 'Constitutional Classifiers,' a defense mechanism using classifier models trained on a constitutional framework to detect and block universal jailbreak attempts against large language models. The approach aims to make AI systems robust against adversarial prompts that attempt to bypass safety measures systematically. The research demonstrates meaningful resistance to jailbreaks while maintaining model usefulness.
Key Points
- •Introduces classifier-based defense system trained using constitutional AI principles to identify and block jailbreak attempts
- •Targets 'universal jailbreaks' — prompts that reliably bypass safety measures across many inputs — rather than one-off exploits
- •Uses a constitutional framework to generate synthetic training data for classifiers, enabling scalable safety enforcement
- •Demonstrates a tradeoff analysis between robustness to adversarial attacks and over-refusal of legitimate requests
- •Represents Anthropic's approach to layered defenses combining classifier guardrails with underlying model alignment
Cited by 4 pages
| Page | Type | Quality |
|---|---|---|
| AI-Assisted Alignment | Approach | 63.0 |
| Circuit Breakers / Inference Interventions | Approach | 64.0 |
| AI Output Filtering | Approach | 63.0 |
| Refusal Training | Approach | 63.0 |
Cached Content Preview
Alignment
# Constitutional Classifiers: Defending against universal jailbreaks
Feb 3, 2025
_A [new paper](https://arxiv.org/abs/2501.18837) from the Anthropic Safeguards Research Team describes a method that defends AI models against universal jailbreaks. A prototype version of the method was robust to thousands of hours of human red teaming for universal jailbreaks, albeit with high overrefusal rates and compute overhead. An updated version achieved similar robustness on synthetic evaluations, and did so with a 0.38% increase in refusal rates and moderate additional compute costs._
Large language models have extensive safety training to prevent harmful outputs. For example, we train Claude to refuse to respond to user queries involving the production of biological or chemical weapons.
Nevertheless, models are still vulnerable to _jailbreaks_: inputs designed to bypass their safety guardrails and force them to produce harmful responses. Some jailbreaks flood the model with [very long prompts](https://www.anthropic.com/research/many-shot-jailbreaking); others modify the [style of the input](https://arxiv.org/abs/2412.03556), such as uSiNg uNuSuAl cApItALiZaTiOn. Historically, jailbreaks have proved difficult to detect and block: these kinds of attacks were [described over 10 years ago](https://arxiv.org/abs/1312.6199), yet to our knowledge there are still no fully robust deep-learning models in production.
We’re developing better jailbreak defenses so that we can safely deploy increasingly capable models in the future. Under our [Responsible Scaling Policy](https://www.anthropic.com/news/announcing-our-updated-responsible-scaling-policy), we may deploy such models as long as we’re able to mitigate risks to acceptable levels through appropriate safeguards—but jailbreaking lets users bypass these safeguards. In particular, we’re hopeful that a system defended by Constitutional Classifiers could allow us to mitigate jailbreaking risks for models which have passed the CBRN capability threshold outlined in our Responsible Scaling Policy1.
In [**our new paper**](https://arxiv.org/abs/2501.18837), we describe a system based on _Constitutional Classifiers_ that guards models against jailbreaks. These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead.
## Results from human red teaming
We ran two main categories of tests to assess the effectiveness of Constitutional Classifiers.
First, we developed a prototype version of the system to identify and block specific scientific knowledge related to chemical, biological, radiological, and nuclear harms. We then invited independent jailbreakers to a [bug-bounty program](https://www.anthropic.com/news/model-safety-bug-bounty) in which they were challenged to “red team” the system (i.e., to attempt to break it under experimental cond
... (truncated, 21 KB total)ce3ac91af6150e19 | Stable ID: ZGZiM2I4Nz