Edited 2 months ago1.7k words1 backlinksUpdated every 6 weeksOverdue by 22 days
65QualityGood •Quality: 65/100LLM-assigned rating of overall page quality, considering depth, accuracy, and completeness.Structure suggests 9366ImportanceUsefulImportance: 66/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.71.5ResearchHighResearch Value: 71.5/100How much value deeper investigation of this topic could yield. Higher scores indicate under-explored topics with high insight potential.
Content8/13
SummarySummaryBasic text summary used in search results, entity link tooltips, info boxes, and related page cards.ScheduleScheduleHow often the page should be refreshed. Drives the overdue tracking system.EntityEntityYAML entity definition with type, description, and related entries.Edit history1Edit historyTracked changes from improve pipeline runs and manual edits.OverviewOverviewA ## Overview heading section that orients readers. Helps with search and AI summaries.
Tables20/ ~7TablesData tables for structured comparisons and reference material.Diagrams1/ ~1DiagramsVisual content — Mermaid diagrams, charts, or Squiggle estimate models.–Int. links3/ ~13Int. linksLinks to other wiki pages. More internal links = better graph connectivity.Add links to other wiki pagesExt. links20/ ~8Ext. linksLinks to external websites, papers, and resources outside the wiki.Footnotes0/ ~5FootnotesFootnote citations [^N] with source references at the bottom of the page.Add [^N] footnote citations–References3/ ~5ReferencesCurated external resources linked via <R> components or cited_by in YAML.Add <R> resource linksQuotes0QuotesSupporting quotes extracted from cited sources to back up page claims.crux citations extract-quotes <id>Accuracy0AccuracyCitations verified against their sources for factual accuracy.crux citations verify <id>RatingsN:4.5 R:5 A:6 C:6.5RatingsSub-quality ratings: Novelty, Rigor, Actionability, Completeness (0-10 scale).Backlinks1BacklinksNumber of other wiki pages that link to this page. Higher backlink count means better integration into the knowledge graph.
Change History1
Fix audit report findings from PR #2167 weeks ago
Reviewed PR #216 (comprehensive wiki audit report) and implemented fixes for the major issues it identified: fixed 181 path-style EntityLink IDs across 33 files, converted 164 broken EntityLinks (referencing non-existent entities) to plain text across 38 files, fixed a temporal inconsistency in anthropic.mdx, and added missing description fields to 53 ai-transition-model pages.
Issues3
QualityRated 65 but structure suggests 93 (underrated by 28 points)
Links7 links could use <R> components
StaleLast edited 67 days ago - may need review
Capability Unlearning / Removal
Approach
Capability Unlearning / Removal
Capability unlearning removes dangerous capabilities (e.g., bioweapon synthesis) from AI models through gradient-based methods, representation engineering, and fine-tuning, achieving 60-80% reduction on WMDP benchmarks with combined approaches. However, verification is impossible, capabilities are recoverable through fine-tuning, and knowledge entanglement limits what can be safely removed, making this a defense-in-depth layer rather than complete solution.
Center for AI SafetyOrganizationCenter for AI SafetyCAIS is a nonprofit research organization founded by Dan Hendrycks that has distributed compute grants to researchers, published technical AI safety papers including the representation engineering ...Quality: 42/100
Approaches
Representation EngineeringApproachRepresentation EngineeringRepresentation engineering enables behavior steering and deception detection by manipulating concept-level vectors in neural networks, achieving 80-95% success in controlled experiments for honesty...Quality: 72/100Responsible Scaling PoliciesApproachResponsible Scaling PoliciesComprehensive analysis of Responsible Scaling Policies showing 20 companies with published frameworks as of Dec 2025, with SaferAI grading major policies 1.9-2.2/5 for specificity. Evidence suggest...Quality: 62/100
1.7k words · 1 backlinks
Overview
Capability unlearning represents a direct approach to AI safety: rather than preventing misuse through behavioral constraints that might be circumvented, remove the dangerous capabilities themselves from the model. If a model genuinely doesn't know how to synthesize dangerous pathogens or construct cyberattacks, it cannot be misused for these purposes regardless of jailbreaks, fine-tuning attacks, or other elicitation techniques.
The approach has gained significant research attention following the development of benchmarks like WMDP (Weapons of Mass Destruction Proxy), released in March 2024 by the Center for AI SafetyOrganizationCenter for AI SafetyCAIS is a nonprofit research organization founded by Dan Hendrycks that has distributed compute grants to researchers, published technical AI safety papers including the representation engineering ...Quality: 42/100 in collaboration with over twenty academic institutions and industry partners. WMDP contains 3,668 multiple-choice questions measuring dangerous knowledge in biosecurity, cybersecurity, and chemical security. Researchers have demonstrated that various techniques including gradient-based unlearning, representation engineeringApproachRepresentation EngineeringRepresentation engineering enables behavior steering and deception detection by manipulating concept-level vectors in neural networks, achieving 80-95% success in controlled experiments for honesty...Quality: 72/100, and fine-tuning can reduce model performance on these benchmarks while preserving general capabilities.
However, the field faces fundamental challenges that may limit its effectiveness. First, verifying complete capability removal is extremely difficult, as capabilities may be recoverable through fine-tuning, prompt engineering, or other elicitation methods. Second, dangerous and beneficial knowledge are often entangled, meaning removal may degrade useful capabilities. Third, for advanced AI systems, the model might understand what capabilities are being removed and resist or hide the remaining knowledge. These limitations suggest capability unlearning is best viewed as one layer in a defense-in-depth strategy rather than a complete solution.
Risk Assessment & Impact
Dimension
Assessment
Evidence
Timeline
Safety Uplift
High (if works)
Would directly remove dangerous capabilities
Near to medium-term
Capability Uplift
Negative
Explicitly removes capabilities
N/A
Net World Safety
Helpful
Would be valuable if reliably achievable
Near-term
Lab Incentive
Moderate
Useful for deployment compliance; may reduce utility
Questions were designed as proxies for hazardous knowledge rather than containing sensitive information directly. The benchmark is publicly available with the most dangerous questions withheld.
Unlearning Effectiveness
The TOFU benchmark (published at COLM 2024) evaluates unlearning on synthetic author profiles, measuring both forgetting quality and model utility retention:
Bioweapons RiskRiskBioweapons RiskComprehensive synthesis of AI-bioweapons evidence through early 2026, including the FRI expert survey finding 5x risk increase from AI capabilities (0.3% → 1.5% annual epidemic probability), Anthro...Quality: 91/100
High
Removes pathogen synthesis, enhancement knowledge
Dual-use biology knowledge entangled
Cyberattacks
High
Removes exploit development, attack techniques
Security knowledge widely distributed
High
Directly reduces dangerous capability surface
Recovery via fine-tuning possible
Open Sourcing Risk
High
Critical for open-weight releases where runtime controls absent
ShieldLM introduces a safety detection framework that trains large language models to identify unsafe content in LLM outputs, offering customizable detection rules and explainable reasoning. The system is designed to align with diverse safety standards and provides transparent justifications for its safety judgments, addressing limitations of black-box moderation systems.
WMDP is a benchmark designed to measure and evaluate hazardous knowledge in large language models related to biosecurity, chemical, nuclear, and radiological weapons. It serves as a proxy for assessing dangerous capabilities in AI systems and supports unlearning research aimed at reducing such risks. The benchmark helps researchers identify and mitigate the potential for LLMs to assist in weapons development.
The Center for AI Safety (CAIS) is a research organization focused on mitigating catastrophic and existential risks from advanced AI systems. It conducts technical research, publishes surveys and statements, and supports field-building efforts across academia and industry. CAIS is notable for its broad coalition-building, including its widely-cited statement on AI extinction risk signed by leading researchers.
AI Uplift Assessment ModelAnalysisAI Uplift Assessment ModelQuantitative assessment estimating AI provides modest knowledge uplift for bioweapons (1.0-1.2x per RAND 2024) but more substantial evasion capabilities (2-3x, potentially 7-10x by 2028). The Virol...Quality: 70/100Bioweapons Attack Chain ModelAnalysisBioweapons Attack Chain ModelMultiplicative attack chain model estimates catastrophic bioweapons probability at 0.02-3.6%, with state actors (3.0%) showing highest estimated risk due to lab access. DNA synthesis screening offe...Quality: 69/100AI-Bioweapons Timeline ModelAnalysisAI-Bioweapons Timeline ModelTimeline model projects AI-bioweapons capabilities crossing four thresholds: knowledge democratization already partially crossed (fully by 2025-2027), synthesis assistance 2027-2032 (median 2029), ...Quality: 58/100
Approaches
Refusal TrainingApproachRefusal TrainingRefusal training achieves 99%+ refusal rates on explicit harmful requests but faces 1.5-6.5% jailbreak success rates (UK AISI 2025) and 12-43% over-refusal on legitimate queries. While necessary fo...Quality: 63/100Dangerous Capability EvaluationsApproachDangerous Capability EvaluationsComprehensive synthesis showing dangerous capability evaluations are now standard practice (95%+ frontier models) but face critical limitations: AI capabilities double every 7 months while external...Quality: 64/100Eliciting Latent Knowledge (ELK)ApproachEliciting Latent Knowledge (ELK)Comprehensive analysis of the Eliciting Latent Knowledge problem with quantified research metrics: ARC's prize contest received 197 proposals, awarded $274K, but $50K and $100K prizes remain unclai...Quality: 91/100
Key Debates
AI Misuse Risk CruxesCruxAI Misuse Risk CruxesComprehensive analysis of AI misuse cruxes with quantified evidence across bioweapons (RAND bio study found no significant difference; novice uplift studies show modest gains on in silico tasks), c...Quality: 65/100