QualityAdequateQuality: 58/100Human-assigned rating of overall page quality, considering depth, accuracy, and completeness.Structure suggests 93
62
ImportanceUsefulImportance: 62/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.
14
Structure14/15Structure: 14/15Automated score based on measurable content features.Word count2/2Tables3/3Diagrams1/2Internal links2/2Citations3/3Prose ratio2/2Overview section1/1
23TablesData tables in the page1DiagramsCharts and visual diagrams9Internal LinksLinks to other wiki pages0FootnotesFootnote citations [^N] with sources13External LinksMarkdown links to outside URLs%2%Bullet RatioPercentage of content in bullet lists
Adversarial training, universally adopted at frontier labs with $10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against model deception or novel attack categories. While necessary for operational security, it only defends external attacks rather than addressing fundamental alignment challenges.
Issues2
QualityRated 58 but structure suggests 93 (underrated by 35 points)
Links2 links could use <R> components
Adversarial Training
Approach
Adversarial Training
Adversarial training, universally adopted at frontier labs with $10-150M/year investment, improves robustness to known attacks but creates an arms race dynamic and provides no protection against model deception or novel attack categories. While necessary for operational security, it only defends external attacks rather than addressing fundamental alignment challenges.
Adversarial training is a technique for improving AI system robustness by training on examples specifically designed to cause failures. For language models, this primarily means training on jailbreak attempts, prompt injections, and other adversarial inputs so that models learn to handle these attacks appropriately rather than being fooled by them. The approach has become standard practice at all major AI labs as a defense against the most common and embarrassing failure modes.
The technique builds on extensive research in adversarial examples for neural networks, where small perturbations to inputs can cause dramatic misclassifications. Goodfellow et al. (2015) introduced the Fast Gradient Sign Method (FGSM) and demonstrated that neural networks' vulnerability to adversarial perturbations stems from their linear nature. Madry et al. (2018) established Projected Gradient Descent (PGD) adversarial training as the gold standard for robustness. For LLMs, adversarial training involves collecting examples of successful attacks (often from red teams or discovered in production), generating model responses to these attacks, and training the model to produce safe responses instead. This creates a feedback loop where new attacks are discovered, added to training data, and defended against.
However, adversarial training faces fundamental limitations. First, it creates an arms race: as models become robust to known attacks, attackers develop new ones, requiring continuous investment. Second, it only defends against attacks the system has been trained on - novel attack categories will still succeed. Third and most critically, adversarial training targets external attacks on the model, not internal model problems. It provides no protection against a deceptive or misaligned model, which could easily generate safe-seeming outputs while pursuing different goals.
Loading diagram...
Risks Addressed
Risk
Relevance
How It Helps
MisuseConceptAI MisuseIntentional harmful use of AI systems by malicious actors, including applications in cyberattacks, disinformation, or weapons.
High
Prevents jailbreaks that could enable harmful content generation
Primary defense against circumventing safety guidelines
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
None
Does not address internal model goals or hidden objectives
Goal MisgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100
None
Targets external inputs, not internal learned representations
Why Adversarial Training Doesn't Help with Deception
Factor
Adversarial Training
Deception Challenge
Target
External inputs
Internal model state
Assumption
Model tries to be good, inputs try to trick it
Model itself might not be aligned
Defense Mechanism
Recognize and refuse bad inputs
Requires understanding model goals
Scope
Input-output behavior
Internal reasoning
A deceptive model faces no challenge from adversarial training - it simply produces outputs that appear compliant while potentially maintaining misaligned internal goals. Adversarial training optimizes the model to handle adversarial inputs, not to have aligned objectives.
Scalability Analysis
Current Scalability
Factor
Status
Trajectory
Attack Discovery
Red teams effective
Automated tools emerging
Training Integration
Well-understood
Scales with model training
Defense Coverage
Expanding
Never complete
Cost
Significant
Growing with attack sophistication
Future Scalability Concerns
Concern
Description
Severity
Attack Generation at Scale
AI can generate novel attacks
High
Fundamental Limits
Can't cover all possible attacks
Structural
SI Attack Surface
Superhuman attackers find novel exploits
Critical
Arms Race Acceleration
Faster iteration, higher costs
Medium
Current Adoption & Investment
Metric
Value
Notes
Annual Investment
$10-150M/year
All labs invest heavily
Adoption Level
Universal
Standard practice
Primary Users
All frontier labs, security researchers
Broad adoption
Recommendation
Maintain
Important but arms race limits value
Differential Progress Analysis
Factor
Assessment
Safety Benefit
Medium - reduces attack surface
Capability Benefit
Some - improves reliability
Overall Balance
Balanced
Relationship to Other Approaches
Complementary Defenses
Approach
Relationship
Benefit
Output Filtering
Defense in depth
Catch training misses
Red TeamingApproachRed TeamingRed teaming is a systematic adversarial evaluation methodology for identifying AI vulnerabilities and dangerous capabilities before deployment, with effectiveness rates varying from 10-80% dependin...Quality: 65/100
Attack discovery
Supplies adversarial examples
Monitoring
Detection
Catch attacks in production
Circuit Breakers
Runtime intervention
Stop detected attacks
Key Distinctions
Approach
Focus
Limitation
Adversarial Training
Input robustness
External attacks only
Interpretability
Internal understanding
Could detect internal issues
Alignment
Model goals
Addresses root cause
Best Practices
Effective Adversarial Training
Practice
Description
Importance
Diverse Attack Coverage
Many attack types and styles
Generalization
Continuous Updates
Regular new attack incorporation
Stay current
Red Team Integration
Active attack discovery
Fresh vulnerabilities
Balanced Training
Don't over-refuse
Capability preservation
Evaluation Rigor
Comprehensive test suites
Verify effectiveness
Common Mistakes
Mistake
Consequence
Mitigation
Static Attack Sets
Model robust to old attacks only
Continuous updates
Over-Refusal
Blocks legitimate uses
Balanced training
Single Attack Type
Vulnerable to other categories
Diverse coverage
No Monitoring
Can't detect new attacks
Production monitoring
Key Uncertainties & Research Directions
Open Questions
Is there a ceiling on adversarial robustness? Or will attacks always exist?
Can attack generation be automated effectively? Would change economics
How to generalize to novel attack categories? Currently weak point
What's the right balance with capability? Over-defense harms usefulness
Misalignment PotentialAi Transition Model FactorMisalignment PotentialThe aggregate risk that AI systems pursue goals misaligned with human values—combining technical alignment challenges, interpretability gaps, and oversight limitations.
Adversarial training is important operational security but doesn't address fundamental alignment challenges - it defends against external attacks while the deeper concern is internal model properties.
Reward ModelingApproachReward ModelingReward modeling, the core component of RLHF receiving $100M+/year investment, trains neural networks on human preference comparisons to enable scalable reinforcement learning. The technique is univ...Quality: 55/100Process SupervisionApproachProcess SupervisionProcess supervision trains AI to show correct reasoning steps rather than just final answers, achieving 15-25% absolute improvements on math benchmarks while making reasoning auditable. However, it...Quality: 65/100Cooperative AIApproachCooperative AICooperative AI research addresses multi-agent coordination failures through game theory and mechanism design, with ~$1-20M/year investment primarily at DeepMind and academic groups. The field remai...Quality: 55/100Refusal TrainingApproachRefusal TrainingRefusal training achieves 99%+ refusal rates on explicit harmful requests but faces 1.5-6.5% jailbreak success rates (UK AISI 2025) and 12-43% over-refusal on legitimate queries. While necessary fo...Quality: 63/100Cooperative IRL (CIRL)ApproachCooperative IRL (CIRL)CIRL is a theoretical framework where AI systems maintain uncertainty about human preferences, which naturally incentivizes corrigibility and deference. Despite elegant theory with formal proofs, t...Quality: 65/100
Concepts
Goal MisgeneralizationRiskGoal MisgeneralizationGoal misgeneralization occurs when AI systems learn transferable capabilities but pursue wrong objectives in deployment, with 60-80% of RL agents exhibiting this failure mode under distribution shi...Quality: 63/100AI MisuseConceptAI MisuseIntentional harmful use of AI systems by malicious actors, including applications in cyberattacks, disinformation, or weapons.Prompt InjectionJailbreakingDeployment Decisions