Constitutional Classifiers arXiv paper (https://arxiv.org/pdf/2501.18837)

paper

2025·arXiv·arxiv.org/pdf/2501.18837

Authors

Mrinank Sharma·Meg Tong·Jesse Mu·Jerry Wei·Jorrit Kruthoff·Scott Goodfriend·Euan Ong·Alwin Peng·Raj Agarwal·Cem Anil·Amanda Askell·Nathan Bailey·Joe Benton·Emma Bluemke·Samuel R. Bowman·Eric Christiansen·Hoagy Cunningham·Andy Dau·Anjali Gopal·Rob Gilson·Logan Graham·Logan Howard·Nimit Kalra·Taesung Lee·Kevin Lin·Peter Lofgren·Francesco Mosconi·Clare O'Hara·Catherine Olsson·Linda Petrini·Samir Rajani·Nikhil Saxena·Alex Silverstein·Tanya Singh·Theodore Sumers·Leonard Tang·Kevin K. Troy·Constantin Weisser·Ruiqi Zhong·Giulio Zhou·Jan Leike·Jared Kaplan·Ethan Perez

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Introduces Constitutional Classifiers, a defense mechanism against universal jailbreaks in LLMs using synthetic data generation and natural language rules, with extensive red teaming validation demonstrating effectiveness while maintaining practical deployability.

Paper Details

Citations

120

13 influential

Year

2025

arXiv:2501.18837 DOI:10.48550/arXiv.2501.18837 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

Summary

This paper introduces Constitutional Classifiers, a defense mechanism against universal jailbreaks in large language models. The approach trains classifiers on synthetic data generated using natural language rules (a constitution) to specify permitted and restricted content. Through extensive red teaming (3,000+ estimated hours), the authors demonstrate that their classifier-guarded LLMs successfully defend against universal jailbreaks while maintaining practical deployment viability, with only a 0.38% increase in production-traffic refusals and 23.7% inference overhead. The work shows that defending against sophisticated, multi-turn attacks that enable harmful processes (like manufacturing illegal substances) is tractable without severely compromising model usability.

Cited by 1 page

Page	Type	Quality
Adversarial Training	Approach	58.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2501.18837] Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 \doparttoc \faketableofcontents 
 
 Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

 
 
 

 
 Mrinank Sharma *+  Meg Tong *  Jesse Mu *  Jerry Wei *  Jorrit Kruthoff *  Scott Goodfriend *  Euan Ong *  Alwin Peng 
 Raj Agarwal  Cem Anil  Amanda Askell  Nathan Bailey  Joe Benton  Emma Bluemke  Samuel R. Bowman  Eric Christiansen Hoagy Cunningham  Andy Dau  Anjali Gopal  Rob Gilson  Logan Graham  Logan Howard  Nimit Kalra ∘  Taesung Lee  Kevin Lin  Peter Lofgren  Francesco Mosconi  Clare O’Hara Catherine Olsson  Linda Petrini □  Samir Rajani  Nikhil Saxena  Alex Silverstein  Tanya Singh  Theodore Sumers  Leonard Tang ∘  Kevin K. Troy  Constantin Weisser ∘  Ruiqi Zhong  Giulio Zhou 
 Jan Leike  Jared Kaplan  Ethan Perez + 
 
 Safeguards Research Team, Anthropic 
 
 
 

 
 Abstract

 Large language models (LLMs) are vulnerable to universal jailbreaks—prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale.
To defend against these attacks, we introduce Constitutional Classifiers : safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content.
In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries.
On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks.
These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead.
Our work demonstrates that defending
against universal jailbreaks while maintaining practical deployment viability is tractable.

 
 * * footnotetext: 
Equal contribution.
 + Equal advising.
 ∘ Haize Labs.
 □ Independent.
Correspondence to <mrinank@anthropic.com> .
First and last author blocks are core contributors, middle authors are listed alphabetically.
See Author Contributions for author contributions.
 
 
 
 1 Introduction

 
 Large language model (LLM) safety mechanisms can be circumvented by “jailbreaks” that elicit harmful information from models (Shen et al., 2023 ; Liu et al., 2023 ; Qi et al., 2024 ; Andriushchenko et al., 2024 ; Anil et al., 2024 ; Hughes et al., 2024 ) .
Such jailbreaks become more concerning as the chemical, biological, radiological, or nuclear (CBRN) capabilities of LLMs increase (Anthropic, 2023a ; OpenAI, 2023 ; Li et al., 2024 ) . 2 2 2 This work was conducted as part of Anthropic’s Responsible Scaling Policy commitments to pr

... (truncated, 98 KB total)

Resource ID: 2d454deae01c7a1e | Stable ID: sid_tzSuzQlXEF