Page StatusResponseTable

Edited 3 days ago

Updated monthlyDue in 4 weeks

Summary

An interactive sortable table summarizing which AI safety approaches are likely to generalize to future architectures. Shows generalization level, dependencies, and threats for each approach.

Issues1

QualityRated 20 but structure suggests 0 (overrated by 20 points)

Safety Generalizability Table

Columns:|

	Expected generalization to future AI architectures	Requires (to work)	Threatened by
Mechanistic Interpretability Circuit-level understanding of model internals. High value if it works, but highly dependent on architecture stability and access. Circuits, probing, activation patching	LOW	✓White-box access available? ✓Representations converge? ✓Architecture stable enough?	✗Heavy scaffolding? ✗Novel architecture emerges?
Training-Based Alignment Shaping model behavior through training signals (RLHF, Constitutional AI, debate). Requires training access but somewhat architecture-agnostic. RLHF, Constitutional AI, debate	MEDIUM	✓Training access available? ✓Gradient-based training continues? ✓Single trainable system?	✗Long distillation chains?
Black-box Evaluations Behavioral testing, capability evals, red-teaming. Only requires query access, relatively architecture-agnostic. Capability evals, red-teaming, benchmarks	MEDIUM-HIGH	✓Query access available? ✓Behavior predictable enough?	✗Emergent multi-agent behavior?
Control & Containment Boxing, monitoring, tripwires, capability control. Focuses on constraining systems regardless of their internals. Sandboxing, monitoring, kill switches	HIGH	✓Sandboxing feasible? ✓Monitoring effective? ✓Capability boundaries clear?	Few threats identified
Theoretical Alignment Mathematical frameworks, optimization theory, agent foundations. Architecture-independent by nature. Agent foundations, decision theory, formal frameworks	HIGHEST	✓Math applies to real systems?	Few threats identified

5 safety approaches