Next-generation Constitutional Classifiers (https://anthropic.com/research/next-generation-constitutional-classifiers)

web

Anthropic·anthropic.com/research/next-generation-constitutional-cla...

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

Anthropic's research on constitutional classifiers is directly relevant to practical AI safety deployment, showing how rule-based content filtering can be made more robust against adversarial misuse at inference time.

Metadata

Importance: 72/100blog postprimary source

Summary

Anthropic presents an updated approach to constitutional classifiers—automated systems that use a set of principles (a 'constitution') to train AI models to detect and refuse harmful content. The research details improvements in robustness, scalability, and resistance to adversarial jailbreaks compared to earlier classifier generations. It represents a key component of Anthropic's layered defense strategy against misuse of frontier AI models.

Key Points

•Constitutional classifiers use a defined set of principles to guide automated content moderation and safety filtering in AI systems.
•Next-generation improvements focus on better generalization, reduced false positives, and stronger resistance to adversarial prompting and jailbreak attempts.
•The approach scales constitutional AI principles from model training to runtime inference-time safety enforcement.
•These classifiers form part of Anthropic's multi-layered safety infrastructure designed to prevent catastrophic misuse such as CBRN weapons assistance.
•The research demonstrates that classifier robustness can be improved through iterative red-teaming and constitutional refinement.

Cited by 2 pages

Page	Type	Quality
Alignment Robustness Trajectory Model	Analysis	64.0
AI-Assisted Alignment	Approach	63.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202610 KB

Alignment Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

 Jan 9, 2026 Read the paper Large language models remain vulnerable to jailbreaks—techniques that can circumvent safety guardrails and elicit harmful information. Over time, we’ve implemented a variety of protections that have made our models much less likely to assist with dangerous user queries—in particular relating to the production of chemical, biological, radiological, or nuclear weapons (CBRN). Nevertheless, no AI systems currently on the market have perfectly robust defenses.

 

 Last year, we described a new approach to defend against jailbreaks which we called “ Constitutional Classifiers :” safeguards that monitor model inputs and outputs to detect and block potentially harmful content. The novel aspect of the approach was that the classifiers were trained on synthetic data generated from a "constitution,” which included natural language rules specifying what’s allowed and what isn’t. For example, Claude should help with college chemistry homework, but not assist in the synthesis of Schedule 1 chemicals.

 

 Constitutional Classifiers worked quite well. Compared to an unguarded model, the first generation of the classifiers reduced the jailbreak success rate from 86% to 4.4%—that is, they blocked 95% of attacks that might otherwise bypass Claude’s built-in safety training. We were particularly interested in whether the classifiers could prevent universal jailbreaks—consistent attack strategies that work across many queries—since these pose the greatest risk of enabling real-world harm. They came close: we ran a bug bounty program challenging people to break the system, in which one universal jailbreak was found.

 

 While effective, those classifiers came with tradeoffs: they increased compute costs by 23.7%, making the models more expensive to use, and also led to a 0.38% increase in refusal rates on harmless queries (that is, it made Claude somewhat more likely to refuse to answer perfectly benign questions, increasing frustration for the user).

 

 We’ve now developed the next generation, Constitutional Classifiers++, and described them in a new paper . They improve on the previous approach, yielding a system that is even more robust, has a much lower refusal rate, and—at just ~1% additional compute cost—is dramatically cheaper to run.

 

 We iterated on many different approaches, ultimately landing on an ensemble system. The core innovation is a two-stage architecture: a probe that looks at Claude’s internal activations (and which is very cheap to run) screens all traffic. If it identifies a suspicious exchange, it escalates it to a more powerful classifier, which, unlike our previous system, screens both sides of a conversation (rather than just outputs), making it better able to recognize jailbreaking attempts. This more robust system has the lowest successful attack rate of any approach we’ve ever tested, with no un

... (truncated, 10 KB total)

Resource ID: 8919b8ee25621cf0 | Stable ID: sid_QwCqwiwqC2