Edited 2 months ago1.8k words12 backlinksUpdated every 6 weeksOverdue by 22 days
58QualityAdequate •Quality: 58/100LLM-assigned rating of overall page quality, considering depth, accuracy, and completeness.Structure suggests 10025.5ImportancePeripheralImportance: 25.5/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.39.5ResearchLowResearch Value: 39.5/100How much value deeper investigation of this topic could yield. Higher scores indicate under-explored topics with high insight potential.
Content7/13
SummarySummaryBasic text summary used in search results, entity link tooltips, info boxes, and related page cards.ScheduleScheduleHow often the page should be refreshed. Drives the overdue tracking system.EntityEntityYAML entity definition with type, description, and related entries.Edit historyEdit historyTracked changes from improve pipeline runs and manual edits.crux edit-log view <id>OverviewOverviewA ## Overview heading section that orients readers. Helps with search and AI summaries.
Tables22/ ~7TablesData tables for structured comparisons and reference material.Diagrams1/ ~1DiagramsVisual content — Mermaid diagrams, charts, or Squiggle estimate models.–Int. links4/ ~15Int. linksLinks to other wiki pages. More internal links = better graph connectivity.Add links to other wiki pagesExt. links13/ ~9Ext. linksLinks to external websites, papers, and resources outside the wiki.Footnotes0/ ~5FootnotesFootnote citations [^N] with source references at the bottom of the page.Add [^N] footnote citations–References2/ ~5ReferencesCurated external resources linked via <R> components or cited_by in YAML.Add <R> resource linksQuotes0QuotesSupporting quotes extracted from cited sources to back up page claims.crux citations extract-quotes <id>Accuracy0AccuracyCitations verified against their sources for factual accuracy.crux citations verify <id>RatingsN:4 R:5 A:5 C:6RatingsSub-quality ratings: Novelty, Rigor, Actionability, Completeness (0-10 scale).Backlinks12BacklinksNumber of other wiki pages that link to this page. Higher backlink count means better integration into the knowledge graph.
Issues3
QualityRated 58 but structure suggests 100 (underrated by 42 points)
Links3 links could use <R> components
StaleLast edited 67 days ago - may need review
Adversarial Training
Approach
Adversarial Training
Adversarial training, universally adopted at frontier labs with $10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against model deception or novel attack categories. While necessary for operational security, it only defends external attacks rather than addressing fundamental alignment challenges.
Adversarial training is a technique for improving AI system robustness by training on examples specifically designed to cause failures. For language models, this primarily means training on jailbreak attempts, prompt injections, and other adversarial inputs so that models learn to handle these attacks appropriately rather than being fooled by them. The approach has become standard practice at all major AI labs as a defense against the most common and embarrassing failure modes.
The technique builds on extensive research in adversarial examples for neural networks, where small perturbations to inputs can cause dramatic misclassifications. Goodfellow et al. (2015) introduced the Fast Gradient Sign Method (FGSM) and demonstrated that neural networks' vulnerability to adversarial perturbations stems from their linear nature. Madry et al. (2018) established Projected Gradient Descent (PGD) adversarial training as the gold standard for robustness. For LLMs, adversarial training involves collecting examples of successful attacks (often from red teams or discovered in production), generating model responses to these attacks, and training the model to produce safe responses instead. This creates a feedback loop where new attacks are discovered, added to training data, and defended against.
However, adversarial training faces fundamental limitations. First, it creates an arms race: as models become robust to known attacks, attackers develop new ones, requiring continuous investment. Second, it only defends against attacks the system has been trained on - novel attack categories will still succeed. Third and most critically, adversarial training targets external attacks on the model, not internal model problems. It provides no protection against a deceptive or misaligned model, which could easily generate safe-seeming outputs while pursuing different goals.
Diagram (loading…)
flowchart TD
subgraph Discovery["Attack Discovery"]
RT[Red Team Testing]
PR[Production Monitoring]
AU[Automated Attack Generation]
end
subgraph Training["Adversarial Training Loop"]
DS[Dataset Creation]
FT[Fine-tune on Attack-Response Pairs]
EV[Evaluate Against Attack Suite]
end
subgraph Defense["Deployed Defenses"]
RB[Robust Model]
OF[Output Filtering]
MO[Runtime Monitoring]
end
RT --> DS
PR --> DS
AU --> DS
DS --> FT
FT --> EV
EV -->|Pass| RB
EV -->|Fail| DS
RB --> OF
OF --> MO
MO -->|New Attacks| PR
Risks Addressed
Risk
Relevance
How It Helps
Misuse
High
Prevents jailbreaks that could enable harmful content generation
Prompt Injection
High
Trains models to distinguish instructions from data
Jailbreaking
High
Primary defense against circumventing safety guidelines
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
None
Does not address internal model goals or hidden objectives
Goal MisgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100
None
Targets external inputs, not internal learned representations
Why Adversarial Training Doesn't Help with Deception
Factor
Adversarial Training
Deception Challenge
Target
External inputs
Internal model state
Assumption
Model tries to be good, inputs try to trick it
Model itself might not be aligned
Defense Mechanism
Recognize and refuse bad inputs
Requires understanding model goals
Scope
Input-output behavior
Internal reasoning
A deceptive model faces no challenge from adversarial training - it simply produces outputs that appear compliant while potentially maintaining misaligned internal goals. Adversarial training optimizes the model to handle adversarial inputs, not to have aligned objectives.
Scalability Analysis
Current Scalability
Factor
Status
Trajectory
Attack Discovery
Red teams effective
Automated tools emerging
Training Integration
Well-understood
Scales with model training
Defense Coverage
Expanding
Never complete
Cost
Significant
Growing with attack sophistication
Future Scalability Concerns
Concern
Description
Severity
Attack Generation at Scale
AI can generate novel attacks
High
Fundamental Limits
Can't cover all possible attacks
Structural
SI Attack Surface
Superhuman attackers find novel exploits
Critical
Arms Race Acceleration
Faster iteration, higher costs
Medium
Current Adoption & Investment
Metric
Value
Notes
Annual Investment
$10-150M/year
All labs invest heavily
Adoption Level
Universal
Standard practice
Primary Users
All frontier labs, security researchers
Broad adoption
Recommendation
Maintain
Important but arms race limits value
Differential Progress Analysis
Factor
Assessment
Safety Benefit
Medium - reduces attack surface
Capability Benefit
Some - improves reliability
Overall Balance
Balanced
Relationship to Other Approaches
Complementary Defenses
Approach
Relationship
Benefit
Output Filtering
Defense in depth
Catch training misses
Red TeamingResearch AreaRed TeamingRed teaming is a systematic adversarial evaluation methodology for identifying AI vulnerabilities and dangerous capabilities before deployment, with effectiveness rates varying from 10-80% dependin...Quality: 65/100
Attack discovery
Supplies adversarial examples
Monitoring
Detection
Catch attacks in production
Circuit Breakers
Runtime intervention
Stop detected attacks
Key Distinctions
Approach
Focus
Limitation
Adversarial Training
Input robustness
External attacks only
Interpretability
Internal understanding
Could detect internal issues
Alignment
Model goals
Addresses root cause
Best Practices
Effective Adversarial Training
Practice
Description
Importance
Diverse Attack Coverage
Many attack types and styles
Generalization
Continuous Updates
Regular new attack incorporation
Stay current
Red Team Integration
Active attack discovery
Fresh vulnerabilities
Balanced Training
Don't over-refuse
Capability preservation
Evaluation Rigor
Comprehensive test suites
Verify effectiveness
Common Mistakes
Mistake
Consequence
Mitigation
Static Attack Sets
Model robust to old attacks only
Continuous updates
Over-Refusal
Blocks legitimate uses
Balanced training
Single Attack Type
Vulnerable to other categories
Diverse coverage
No Monitoring
Can't detect new attacks
Production monitoring
Key Uncertainties & Research Directions
Open Questions
Is there a ceiling on adversarial robustness? Or will attacks always exist?
Can attack generation be automated effectively? Would change economics
How to generalize to novel attack categories? Currently weak point
What's the right balance with capability? Over-defense harms usefulness
This paper presents an automated method for generating adversarial suffixes that can jailbreak aligned large language models, causing them to produce objectionable content. Rather than relying on manual engineering, the approach uses greedy and gradient-based search techniques to find universal attack suffixes that can be appended to harmful queries. Remarkably, these adversarial suffixes demonstrate strong transferability across different models and architectures, successfully inducing harmful outputs in both closed-source systems (ChatGPT, Bard, Claude) and open-source models (LLaMA-2-Chat, Pythia, Falcon). This work significantly advances adversarial attack capabilities against aligned LLMs and highlights critical vulnerabilities in current safety alignment approaches.
This paper introduces Constitutional Classifiers, a defense mechanism against universal jailbreaks in large language models. The approach trains classifiers on synthetic data generated using natural language rules (a constitution) to specify permitted and restricted content. Through extensive red teaming (3,000+ estimated hours), the authors demonstrate that their classifier-guarded LLMs successfully defend against universal jailbreaks while maintaining practical deployment viability, with only a 0.38% increase in production-traffic refusals and 23.7% inference overhead. The work shows that defending against sophisticated, multi-turn attacks that enable harmful processes (like manufacturing illegal substances) is tractable without severely compromising model usability.
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100Goal MisgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100DeepfakesRiskDeepfakesComprehensive overview of deepfake risks documenting $60M+ in fraud losses, 90%+ non-consensual imagery prevalence, and declining detection effectiveness (65% best accuracy). Reviews technical capa...Quality: 50/100
Approaches
Reward ModelingApproachReward ModelingReward modeling, the core component of RLHF receiving $100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is univ...Quality: 55/100Process SupervisionApproachProcess SupervisionProcess supervision trains AI to show correct reasoning steps rather than just final answers, achieving 15-25% absolute improvements on math benchmarks while making reasoning auditable. However, it...Quality: 65/100Cooperative AIApproachCooperative AICooperative AI research addresses multi-agent coordination failures through game theory and mechanism design, with ~$1-20M/year investment primarily at DeepMind and academic groups. The field remai...Quality: 55/100Refusal TrainingApproachRefusal TrainingRefusal training achieves 99%+ refusal rates on explicit harmful requests but faces 1.5-6.5% jailbreak success rates (UK AISI 2025) and 12-43% over-refusal on legitimate queries. While necessary fo...Quality: 63/100
Key Debates
AI Accident Risk CruxesCruxAI Accident Risk CruxesComprehensive survey of AI safety researcher disagreements on accident risks, quantifying probability ranges for mesa-optimization (15-55%), deceptive alignment (15-50%), and P(doom) (5-35% median ...Quality: 67/100Why Alignment Might Be HardArgumentWhy Alignment Might Be HardA comprehensive taxonomy of alignment difficulty arguments spanning specification problems, inner alignment failures, verification limits, and adversarial dynamics, with expert p(doom) estimates ra...Quality: 69/100Is AI Existential Risk Real?CruxIs AI Existential Risk Real?Covers the foundational AI x-risk debate across four core cruxes: instrumental convergence, warning sign availability, corrigibility achievability, and timeline urgency. Incorporates quantitative e...Quality: 12/100
Other
Red TeamingResearch AreaRed TeamingRed teaming is a systematic adversarial evaluation methodology for identifying AI vulnerabilities and dangerous capabilities before deployment, with effectiveness rates varying from 10-80% dependin...Quality: 65/100Paul ChristianoPersonPaul ChristianoComprehensive biography of Paul Christiano documenting his technical contributions (IDA, debate, scalable oversight), risk assessment (~10-20% P(doom), AGI 2030s-2040s), and evolution from higher o...Quality: 39/100
Concepts
Safety Orgs OverviewSafety Orgs OverviewA well-organized reference overview of ~20 AI safety organizations categorized by function (alignment research, policy, field-building), with a comparative budget/headcount table showing estimated ...Quality: 48/100Alignment Training OverviewAlignment Training OverviewA bare-bones index page listing 10 alignment training methods (RLHF, Constitutional AI, DPO, process supervision, etc.) with one-line descriptions and links to deeper pages, providing no analysis, ...Quality: 27/100AI Misuse