Constitutional Classifiers arXiv paper (https://arxiv.org/pdf/2501.18837)
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Introduces Constitutional Classifiers, a defense mechanism against universal jailbreaks in LLMs using synthetic data generation and natural language rules, with extensive red teaming validation demonstrating effectiveness while maintaining practical deployability.
Paper Details
Metadata
Abstract
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.
Summary
This paper introduces Constitutional Classifiers, a defense mechanism against universal jailbreaks in large language models. The approach trains classifiers on synthetic data generated using natural language rules (a constitution) to specify permitted and restricted content. Through extensive red teaming (3,000+ estimated hours), the authors demonstrate that their classifier-guarded LLMs successfully defend against universal jailbreaks while maintaining practical deployment viability, with only a 0.38% increase in production-traffic refusals and 23.7% inference overhead. The work shows that defending against sophisticated, multi-turn attacks that enable harmful processes (like manufacturing illegal substances) is tractable without severely compromising model usability.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Adversarial Training | Approach | 58.0 |
Cached Content Preview
[2501.18837] Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
\doparttoc \faketableofcontents
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Mrinank Sharma *+ Meg Tong * Jesse Mu * Jerry Wei * Jorrit Kruthoff * Scott Goodfriend * Euan Ong * Alwin Peng
Raj Agarwal Cem Anil Amanda Askell Nathan Bailey Joe Benton Emma Bluemke Samuel R. Bowman Eric Christiansen Hoagy Cunningham Andy Dau Anjali Gopal Rob Gilson Logan Graham Logan Howard Nimit Kalra ∘ Taesung Lee Kevin Lin Peter Lofgren Francesco Mosconi Clare O’Hara Catherine Olsson Linda Petrini □ Samir Rajani Nikhil Saxena Alex Silverstein Tanya Singh Theodore Sumers Leonard Tang ∘ Kevin K. Troy Constantin Weisser ∘ Ruiqi Zhong Giulio Zhou
Jan Leike Jared Kaplan Ethan Perez +
Safeguards Research Team, Anthropic
Abstract
Large language models (LLMs) are vulnerable to universal jailbreaks—prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale.
To defend against these attacks, we introduce Constitutional Classifiers : safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content.
In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries.
On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks.
These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead.
Our work demonstrates that defending
against universal jailbreaks while maintaining practical deployment viability is tractable.
* * footnotetext:
Equal contribution.
+ Equal advising.
∘ Haize Labs.
□ Independent.
Correspondence to <mrinank@anthropic.com> .
First and last author blocks are core contributors, middle authors are listed alphabetically.
See Author Contributions for author contributions.
1 Introduction
Large language model (LLM) safety mechanisms can be circumvented by “jailbreaks” that elicit harmful information from models (Shen et al., 2023 ; Liu et al., 2023 ; Qi et al., 2024 ; Andriushchenko et al., 2024 ; Anil et al., 2024 ; Hughes et al., 2024 ) .
Such jailbreaks become more concerning as the chemical, biological, radiological, or nuclear (CBRN) capabilities of LLMs increase (Anthropic, 2023a ; OpenAI, 2023 ; Li et al., 2024 ) . 2 2 2 This work was conducted as part of Anthropic’s Responsible Scaling Policy commitments to pr
... (truncated, 98 KB total)2d454deae01c7a1e | Stable ID: OWI5NTE1OT