Circuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference. Gray Swan's representation rerouting achieves 87-90% rejection rates with 1% capability loss, while Anthropic's Constitutional Classifiers block 95.6% of jailbreaks with 0.38% over-refusal increase. However, the UK AISI challenge found all 22 tested models eventually broken (62K/1.8M attempts succeeded), and novel token-forcing attacks achieve 25% success rates, highlighting fundamental limitations of reactive defenses.
Circuit Breakers / Inference Interventions
Circuit Breakers / Inference Interventions
Circuit breakers are runtime safety interventions that detect and halt harmful AI outputs during inference. Gray Swan's representation rerouting achieves 87-90% rejection rates with 1% capability loss, while Anthropic's Constitutional Classifiers block 95.6% of jailbreaks with 0.38% over-refusal increase. However, the UK AISI challenge found all 22 tested models eventually broken (62K/1.8M attempts succeeded), and novel token-forcing attacks achieve 25% success rates, highlighting fundamental limitations of reactive defenses.
Overview
Circuit breakers represent a class of runtime interventions that can detect and stop harmful model behavior during inference, before outputs reach users or actions are executed. Unlike output filteringApproachAI Output FilteringComprehensive analysis of AI output filtering showing detection rates of 70-98% depending on content type, with 100% of models vulnerable to jailbreaks per UK AISI testing, though Anthropic's Const...Quality: 63/100 which operates on completed outputs, circuit breakers can intervene mid-generation, potentially stopping harm earlier in the process. This includes monitoring activation patterns, detecting emerging harmful content, and intervening when dangerous patterns are detected.
The approach draws inspiration from electrical circuit breakers that automatically interrupt dangerous current flows, and from software systems that halt operations when safety invariants are violated. For AI systems, circuit breakers can detect when a model is generating content that violates safety policies, when activation patterns suggest deceptive or manipulative intent, or when the system is attempting unauthorized actions.
Research organizations like Gray Swan AI have developed circuit breaker techniques that can reduce harmful outputs by modifying model behavior at inference time. Their 2024 paper "Improving Alignment and Robustness with Circuit Breakers" demonstrated that representation rerouting can reject harmful requests 87-90% of the time while preserving model capabilities. However, the approach faces fundamental limitations: it remains reactive rather than proactive, sophisticated models could potentially generate harm faster than circuit breakers can respond, and determined adversaries may find ways to trigger harmful outputs that evade detection. The UK AISI × Gray Swan Agent Red-Teaming Challenge (March-April 2025) tested 22 different LLMs with 1.8 million attack attempts, finding 62,000 successful breaks—demonstrating that no current frontier system fully resists determined, well-resourced attacks. Circuit breakers are a valuable last line of defense but should not substitute for addressing underlying alignment issues.
Risk Assessment & Impact
| Dimension | Rating | Assessment |
|---|---|---|
| Safety Uplift | Medium | Can prevent harmful outputs in real-time; reactive defense |
| Capability Uplift | Tax | Interventions may reduce fluency or capability |
| Net World Safety | Helpful | Valuable last line of defense; doesn't address root causes |
| Lab Incentive | Moderate | Practical for deployment safety; some product impact |
| Scalability | Partial | Works at scale; but sophisticated attacks may evade |
| Deception Robustness | Weak | Deceptive model could generate harm before circuit breaks |
| SI Readiness | No | SI could reason around or disable circuit breakers |
Research Investment
- Current Investment: $10-30M/yr (Gray Swan, AnthropicOrganizationAnthropicComprehensive profile of Anthropic, founded in 2021 by seven former OpenAI researchers (Dario and Daniela Amodei, Chris Olah, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish) with early funding..., various labs)
- Recommendation: Increase (practical near-term intervention; needs more research)
- Differential Progress: Safety-leaning (primarily safety; some reliability benefits)
Comparison of Circuit Breaker Approaches
Different circuit breaker implementations offer varying tradeoffs between safety effectiveness, capability preservation, and computational cost. The following table compares major approaches based on published research and evaluations.
| Approach | Mechanism | Jailbreak Rejection Rate | Capability Impact | Compute Overhead | Limitations |
|---|---|---|---|---|---|
| Representation Rerouting (RR) | Redirects harmful internal representations to orthogonal space | 87-90% | ≈1% capability loss | Low (≈5%) | Vulnerable to novel token-forcing attacks |
| Constitutional Classifiers | Input/output filters trained on constitutional principles | 95.6% (from 86% baseline) | 0.38% increased refusal | 1-24% (improved over time) | No universal jailbreaks found but specific attacks possible |
| Refusal TrainingApproachRefusal TrainingRefusal training achieves 99%+ refusal rates on explicit harmful requests but faces 1.5-6.5% jailbreak success rates (UK AISI 2025) and 12-43% over-refusal on legitimate queries. While necessary fo...Quality: 63/100 (RLHFCapabilityRLHFRLHF/Constitutional AI achieves 82-85% preference improvements and 40.8% adversarial attack reduction for current systems, but faces fundamental scalability limits: weak-to-strong supervision shows...Quality: 63/100) | Train model to refuse harmful requests directly | 40-70% (varies widely) | Can reduce helpfulness | None at inference | Highly vulnerable to adversarial attacks |
| Adversarial TrainingApproachAdversarial TrainingAdversarial training, universally adopted at frontier labs with $10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against mo...Quality: 58/100 | Train against known attack patterns | 60-80% on trained attacks | Minor | High during training | Poor generalization to novel attacks |
| Activation Clamping | Modify activations when harmful patterns detected | 70-85% | 5-15% capability loss | Medium | Requires interpretability research |
| Output Filtering | Post-generation content moderation | 50-70% | None | Medium | Can be bypassed with encoded content |
Sources: Gray Swan Circuit Breakers Paper, Anthropic Constitutional Classifiers, HarmBench
Quantified Effectiveness Against Attack Types
The following table shows measured attack success rates (lower is better for defense) across different defense methods when tested against standardized attack benchmarks.
| Attack Type | No Defense | Refusal Training | Circuit Breakers (RR) | Constitutional Classifiers | Combined Defense |
|---|---|---|---|---|---|
| Direct Harmful Requests | 95% ASR | 15-30% ASR | 10-13% ASR | 4.4% ASR | 2-5% ASR |
| GCG (Gradient-based) | 90% ASR | 60-80% ASR | 8-12% ASR | 5% ASR | 3-8% ASR |
| PAIR (LLM Optimizer) | 85% ASR | 40-60% ASR | 10-15% ASR | 6% ASR | 4-10% ASR |
| AutoDAN | 80% ASR | 50-70% ASR | 12-18% ASR | 7% ASR | 5-12% ASR |
| Human Jailbreaks | 75% ASR | 35-50% ASR | 15-20% ASR | 8% ASR | 6-15% ASR |
| Novel Token-Forcing | 90% ASR | 70-85% ASR | 25-40% ASR | Unknown | 15-30% ASR |
ASR = Attack Success Rate. Sources: HarmBench evaluations, Breaking Circuit Breakers, Constitutional Classifiers paper
How Circuit Breakers Work
Circuit breakers operate at inference time, monitoring and potentially intervening during model generation:
Intervention Types
| Type | Mechanism | Use Case | Tradeoff |
|---|---|---|---|
| Hard Stop | Halt generation immediately | Clear policy violation | May truncate mid-sentence |
| Soft Redirect | Steer generation away from harm | Emerging concern | May produce awkward output |
| Activation Clamping | Modify internal activations | Representation-level intervention | Requires interpretability |
| Token Blocking | Prevent specific token generation | Known harmful patterns | Easily circumvented |
| Probability Shifting | Reduce likelihood of harmful continuations | Subtle steering | May affect quality |
Detection Mechanisms
| Mechanism | What It Detects | Speed | Accuracy |
|---|---|---|---|
| Token-level scanning | Harmful words/phrases | Very fast | Low (easy to bypass) |
| Sequence classification | Harmful content patterns | Fast | Medium |
| Activation analysis | Internal state patterns | Medium | Higher potential |
| Semantic analysis | Meaning/intent of content | Slower | Higher accuracy |
| Behavioral pattern matching | Multi-step harmful sequences | Slowest | Context-dependent |
Gray Swan's Circuit Breaker Research
Gray Swan AI, in collaboration with Carnegie Mellon University and the Center for AI Safety, has been a leader in circuit breaker research. Their June 2024 paper "Improving Alignment and Robustness with Circuit Breakers" (Zou et al.) introduced representation rerouting as a more robust alternative to refusal training.
Key Research Findings
| Technique | Approach | Quantified Result | Source |
|---|---|---|---|
| Representation Rerouting (RR) | Redirect harmful representations to orthogonal activation space | 87-90% harmful request rejection rate | arXiv:2406.04313 |
| Cygnet Model | Llama-3-8B-Instruct finetune with circuit breakers | ≈100x reduction in harmful outputs vs baseline | Gray Swan Research |
| Capability Preservation | Pareto-optimal safety/capability tradeoff | Only 1% dip in MT-Bench and MMLU scores | arXiv:2406.04313 |
| UK AISI Red-Teaming | Large-scale adversarial evaluation | 62K successful breaks across 22 models from 1.8M attempts | Gray Swan News |
Technical Approach: Representation Rerouting
The circuit breaker method operates through four key steps:
-
Identify harmful representations: Using contrastive activation pairs from harmful vs. safe prompts, identify the activation directions in the model's internal representation space that correspond to harmful outputs
-
Create intervention vectors: Develop orthogonal projection matrices that can reroute activations away from harmful regions while preserving the geometric structure needed for benign capabilities
-
Apply at inference: Monitor residual stream activations at key layers (typically layers 8-24 in Llama-scale models) and apply the rerouting transformation when harmful patterns are detected
-
Maintain capability: The orthogonal rerouting preserves distances and angles between non-harmful representations, enabling ~99% capability retention on benchmarks like MT-Bench and MMLU
Activation-Level Interventions
More sophisticated circuit breakers operate at the activation level:
How Activation Intervention Works
Intervention Targets
| Target | Description | Advantage | Challenge |
|---|---|---|---|
| Residual stream | Main information flow | Direct impact | May disrupt coherence |
| Attention patterns | What model focuses on | Can redirect attention | Complex to interpret |
| MLP activations | Feature representations | Feature-level control | Requires interpretability |
| Layer outputs | Per-layer representations | Can catch early | Need to know which layers |
Limitations and Challenges
Research from Confirm Labs and other groups has identified significant weaknesses in current circuit breaker implementations. Understanding these limitations is essential for realistic assessment of the approach.
Fundamental Issues
| Limitation | Explanation | Quantified Impact | Mitigation |
|---|---|---|---|
| Reactive | Can only respond to detected patterns | N/A - architectural | Better detection, faster response |
| Speed constraints | Must be faster than generation | 2-10ms per token overhead | Hardware optimization, early-layer detection |
| False positives | May block legitimate content | 4% to 38.5% over-refusal increase on OR-Bench | Calibration, soft interventions, constitutional classifiers |
| Circumvention | Novel attacks evade detection | 25% ASR for novel token-forcing attacks | Continuous updating, layered defenses |
| Capability tax | Interventions may degrade quality | 1-15% capability loss depending on method | Minimal intervention approaches |
Sources: Breaking Circuit Breakers, OR-Bench evaluations
Why Circuit Breakers Fail Against Sophisticated Adversaries
| Attack Vector | How It Works | Why Breakers Fail | Measured Success Rate |
|---|---|---|---|
| Novel token-forcing | Use alternative completion patterns (e.g., "1. Choose the right..." instead of "Sure, here's how...") | Circuit breakers trained on specific harmful prefixes don't generalize | 25% ASR vs 2.5% for standard GCG |
| Distillation attacks | White-box attacks targeting internal activations directly | Can break model even while requiring attack fluency | High (model-specific) |
| Gradual generation | Build harmful content slowly across many tokens | Each individual step appears safe to detectors | 40-60% ASR |
| Encoded content | Obfuscate harmful content with base64, pig latin, etc. | Semantic analysis required at representation level | 30-50% ASR |
| Instruction hierarchy violations | Harm embedded in seemingly-safe task instructions | Context-dependent; hard to distinguish from legitimate use | 35-45% ASR |
Source: Gray Swan Cygnet-8B was jailbroken in 3 hours despite impressive 0.0 scores in initial red-teaming evaluations
The Speed Problem
For circuit breakers to be effective, they must:
- Detect harmful patterns before harm occurs
- Intervene before output reaches user/system
- Not add prohibitive latency
| Scenario | Challenge |
|---|---|
| Streaming outputs | Partial output already delivered |
| Agentic actions | Action may be executed before detection |
| High throughput | Latency costs multiply at scale |
| Sophisticated attacks | May generate harm faster than detection |
Key Cruxes
Crux 1: Are Circuit Breakers a Meaningful Safety Measure?
| Position: Yes (55% probability) | Position: Limited Value (45% probability) |
|---|---|
| Last line of defense catches 87-95% of known attacks | Reactive defense; determined adversaries find bypasses |
| UK AISI challenge: only 3.4% success rate (62K/1.8M) | Same challenge: every model eventually broken |
| Defense-in-depth reduces expected harm | May create false confidence in safety |
| Anthropic's 3,000 red-teamer hours found no universal jailbreak | Future techniques may discover vulnerabilities |
Key evidence that would update toward "Yes": Circuit breakers remaining robust against novel attack classes for 2+ years. Key evidence for "Limited Value": Automated jailbreak discovery that scales faster than defense updates.
Crux 2: Can Activation-Level Interventions Be Made Robust?
| Position: Promising (40% probability) | Position: Fundamental Limits (60% probability) |
|---|---|
| Representation rerouting achieves 87-90% rejection without capability loss | Theoretical results show quadratic helpfulness cost for guaranteed alignment |
| Constitutional classifiers + RR combined may exceed 95% effectiveness | Arms race dynamics: Cygnet-8B broken in 3 hours despite strong initial results |
| Interpretability research advancing rapidly | Deceptive models could potentially learn to hide harmful representations |
| Works on multimodal models | 25% ASR for novel token-forcing shows generalization limits |
Key evidence that would update toward "Promising": Formal verification methods for representation-level safety. Key evidence for "Fundamental Limits": Systematic discovery of attacks that generalize across circuit breaker implementations.
Crux 3: Is the Capability Tax Acceptable?
| Position: Worth It (65% probability) | Position: Too High (35% probability) |
|---|---|
| Anthropic achieved 1% overhead with 95.6% protection | Early implementations showed 24% overhead; user experience matters |
| 0.38% over-refusal increase is negligible for most applications | 4% to 38.5% over-refusal on OR-Bench shows tradeoff can be severe |
| Enterprise customers increasingly demand safety certifications | Competitive pressure may drive users to less safe alternatives |
| Tax decreasing with better research (1% vs 24% over 1 year) | Some applications cannot tolerate any capability degradation |
Key evidence that would update toward "Worth It": Customer willingness-to-pay for verified safety. Key evidence for "Too High": Significant user migration to unrestricted models.
Anthropic's Constitutional Classifiers
Anthropic's Constitutional Classifiers represent a complementary approach to circuit breakers, using input/output filtering rather than activation-level intervention. Their January 2025 paper demonstrated impressive results in a large-scale red-teaming evaluation.
Red-Teaming Results
| Metric | Baseline (No Defense) | With Constitutional Classifiers | Improvement |
|---|---|---|---|
| Jailbreak Success Rate | 86% | 4.4% | 95% reduction |
| Over-refusal Rate | Baseline | +0.38% (not statistically significant) | Minimal impact |
| Compute Overhead | N/A | 1% (improved from 24% in earlier version) | 96% cost reduction |
| Universal Jailbreaks Found | N/A | 0 (after 3,000+ red-teamer hours) | None discovered |
Source: Anthropic Constitutional Classifiers
Red-Teaming Challenge Details
Anthropic conducted a two-month red-teaming challenge with significant participation:
- 185 active participants competed for $15,000 in rewards
- 3,000+ hours of cumulative red-teaming effort
- 560,000+ queries submitted across the full testing period
- 1 "high-risk vulnerability" found (but no universal jailbreak)
- Result: No participant discovered a single jailbreak that worked across all 10 forbidden query categories
This represents one of the most extensive public evaluations of an AI safety defense mechanism, though researchers acknowledge that future techniques may find vulnerabilities.
Best Practices
Implementation Architecture
Design Principles
| Principle | Implementation |
|---|---|
| Fail-safe | Default to blocking in ambiguous cases |
| Minimal intervention | Smallest change to prevent harm |
| Fast path | Optimize for low-latency common cases |
| Auditability | Log all interventions for review |
| Graceful degradation | Handle breaker failures safely |
Calibration Approach
| Concern | Calibration Strategy |
|---|---|
| Too many false positives | Raise detection thresholds, use soft interventions |
| Missing harmful content | Lower thresholds, expand detection patterns |
| Latency too high | Optimize detection, use progressive approaches |
| Capability degradation | Minimize intervention strength, targeted modifications |
Defense-in-Depth Architecture
Modern AI safety systems increasingly combine multiple circuit breaker approaches in a layered defense architecture. The following diagram illustrates how different mechanisms can work together.
This layered approach achieves better results than any single method:
- Input classifiers catch 70-80% of obvious jailbreak attempts early
- Activation monitoring catches 15-20% of remaining threats during generation
- Output classifiers catch 5-10% that slip through earlier layers
- Combined false positive rate remains below 5% when properly calibrated
Who Should Work on This?
Good fit if you believe:
- Practical near-term interventions are valuable
- Defense-in-depth is worth pursuing
- Runtime safety can complement training
- Incremental improvements help
Less relevant if you believe:
- Sophisticated AI will always circumvent
- Better to focus on alignment
- Capability tax is unacceptable
- Creates false sense of security
Current State of Practice
Industry Adoption
| Organization | Approach | Key Results | Maturity |
|---|---|---|---|
| Gray Swan AI | Representation rerouting, red-teaming | 87-90% rejection rate; hosted UK AISI challenge with 1.8M attacks | Research leader |
| Anthropic | Constitutional Classifiers + monitoring | 95.6% jailbreak blocking; 0.38% over-refusal increase; 1% compute overhead | Production deployment |
| OpenAI | Content filtering, moderation API | Integrated into GPT-4 and API products | Production deployment |
| Cisco (Robust Intelligence) | AI Firewall, algorithmic red-teaming | Acquired October 2024 for enterprise AI security | Enterprise solutions |
| METR/Apollo | Third-party evaluation protocols | Independent safety assessment | Evaluation standards |
Sources: Gray Swan Research, Anthropic Constitutional Classifiers, Cisco AI Defense
Research Directions
| Direction | Current Progress | Key Challenges | Estimated Timeline |
|---|---|---|---|
| Faster detection | 2-10ms overhead achieved | Maintaining accuracy at lower latency | Ongoing |
| Activation-level interventions | RR demonstrated; probes developing | Requires interpretability advances | 1-2 years |
| Adaptive breakers | Early research | Learning without creating vulnerabilities | 2-3 years |
| Minimal intervention | 1% capability tax achieved by Anthropic | Maintaining safety at lower intervention strength | Ongoing |
| Formal guarantees | Theoretical results showing quadratic helpfulness loss | Practical guarantees remain elusive | 3-5+ years |
| Multimodal circuit breakers | Demonstrated on vision-language models | Complexity of cross-modal harmful content | 1-2 years |
Sources: Representation Engineering review, Constitutional Classifiers++
Sources & Resources
Primary Research Papers
| Paper | Authors | Key Contribution | Link |
|---|---|---|---|
| Improving Alignment and Robustness with Circuit Breakers | Zou et al. (Gray Swan, CMU, CAIS) | Introduced representation rerouting; 87-90% rejection rate | arXiv:2406.04313 |
| Constitutional Classifiers: Defending against Universal Jailbreaks | Anthropic | 95.6% jailbreak blocking; 0.38% over-refusal | Anthropic Research |
| Representation Engineering: A Top-Down Approach to AI Transparency | Zou et al. (CAIS) | Foundation for circuit breaker methods | CAIS Blog |
| Breaking Circuit Breakers | Confirm Labs | Identified 25% ASR for novel token-forcing attacks | Confirm Labs |
| HarmBench: A Standardized Evaluation Framework | CAIS et al. | Standardized red-teaming benchmark | GitHub |
Industry Resources
| Organization | Focus | Key Resources |
|---|---|---|
| Gray Swan AI | Circuit breakers, red-teaming | Research Portal, Arena Platform |
| Anthropic | Constitutional AI, safety classifiers | Constitutional Classifiers |
| UK AI Safety Institute | Government evaluation, red-teaming partnerships | AISI Research |
| Center for AI Safety | HarmBench, representation engineering research | CAIS Research |
| JailbreakBench | Standardized jailbreak robustness benchmark | JailbreakBench |
Key Evaluations and Datasets
| Benchmark | Purpose | Key Metrics |
|---|---|---|
| HarmBench | Standardized red-teaming framework | Attack Success Rate (ASR) across attack types |
| JailbreakBench | Robustness benchmark for jailbreaking | Leaderboard rankings, reproducible attacks |
| OR-Bench | Over-refusal evaluation | False positive rate on harmless queries |
| MT-Bench / MMLU | Capability preservation | General capability retention after safety training |
Key Critiques and Limitations
- Reactive not proactive: Circuit breakers respond to detected patterns but don't address root causes of misalignment—a sufficiently capable or deceptive model could generate harm before intervention triggers
- Adversarial arms race: Confirm Labs research showed Gray Swan's Cygnet-8B was jailbroken in 3 hours despite impressive initial evaluations; no single defense is expected to remain robust indefinitely
- Capability-safety tradeoff: Theoretical results suggest alignment guarantees come at quadratic cost to helpfulness, potentially saturating at random guessing for strong interventions
- Open-source model gap: Circuit breakers require model modification; open-source models without safety training remain vulnerable, and fine-tuning can remove circuit breaker training
AI Transition Model Context
Circuit breakers affect the Ai Transition Model through:
| Parameter | Impact |
|---|---|
| Misuse PotentialAi Transition Model FactorMisuse PotentialThe aggregate risk from deliberate harmful use of AI—including biological weapons, cyber attacks, autonomous weapons, and surveillance misuse. | Can catch some harmful outputs in real-time |
| Human Oversight QualityAi Transition Model ParameterHuman Oversight QualityThis page contains only a React component placeholder with no actual content rendered. Cannot assess substance, methodology, or conclusions. | Provides automated enforcement of safety policies |
Circuit breakers are a valuable addition to the AI safety toolkit, providing a last line of defense that can catch issues other measures miss. Current implementations achieve 87-95% effectiveness against known attacks with acceptable capability tradeoffs (1% overhead for Anthropic's system). However, they should be understood as one layer in a defense-in-depth strategy, not a substitute for addressing fundamental alignment challenges. The adversarial arms race continues, with novel attacks regularly discovered that bypass existing defenses—reinforcing the need for ongoing research and layered approaches.