Page StatusResponseTable
Edited 3 days ago
20
QualityDraftQuality: 20/100Human-assigned rating of overall page quality, considering depth, accuracy, and completeness.Structure suggests 030
ImportanceReferenceImportance: 30/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.0
Structure0/15Structure: 0/15Automated score based on measurable content features.Word count0/2Tables0/3Diagrams0/2Internal links0/2Citations0/3Prose ratio2/2Overview section0/10TablesData tables in the page0DiagramsCharts and visual diagrams0Internal LinksLinks to other wiki pages0FootnotesFootnote citations [^N] with sources0External LinksMarkdown links to outside URLs%0%Bullet RatioPercentage of content in bullet lists
Updated monthlyDue in 4 weeks
Summary
An interactive sortable table summarizing which AI safety approaches are likely to generalize to future architectures. Shows generalization level, dependencies, and threats for each approach.
Issues1
QualityRated 20 but structure suggests 0 (overrated by 20 points)
Safety Generalizability Table
Columns:|
Expected generalization to future AI architectures | Requires (to work) | Threatened by | |
|---|---|---|---|
Mechanistic Interpretability Circuit-level understanding of model internals. High value if it works, but highly dependent on architecture stability and access. Circuits, probing, activation patching | LOW |
|
|
Training-Based Alignment Shaping model behavior through training signals (RLHF, Constitutional AI, debate). Requires training access but somewhat architecture-agnostic. RLHF, Constitutional AI, debate | MEDIUM |
|
|
Black-box Evaluations Behavioral testing, capability evals, red-teaming. Only requires query access, relatively architecture-agnostic. Capability evals, red-teaming, benchmarks | MEDIUM-HIGH |
|
|
Control & Containment Boxing, monitoring, tripwires, capability control. Focuses on constraining systems regardless of their internals. Sandboxing, monitoring, kill switches | HIGH |
| Few threats identified |
Theoretical Alignment Mathematical frameworks, optimization theory, agent foundations. Architecture-independent by nature. Agent foundations, decision theory, formal frameworks | HIGHEST |
| Few threats identified |
5 safety approaches