Longterm Wiki
Updated 2026-02-10HistoryData
Page StatusResponseTable
Edited 3 days ago
20
QualityDraft
30
ImportanceReference
0
Structure0/15
00000%0%
Updated monthlyDue in 4 weeks
Summary

An interactive sortable table summarizing which AI safety approaches are likely to generalize to future architectures. Shows generalization level, dependencies, and threats for each approach.

Issues1
QualityRated 20 but structure suggests 0 (overrated by 20 points)

Safety Generalizability Table

Columns:|
Expected generalization to future AI architectures
Requires (to work)Threatened by
Mechanistic Interpretability
Circuit-level understanding of model internals. High value if it works, but highly dependent on architecture stability and access.
Circuits, probing, activation patching
LOW
  • White-box access available?
  • Representations converge?
  • Architecture stable enough?
  • Heavy scaffolding?
  • Novel architecture emerges?
Training-Based Alignment
Shaping model behavior through training signals (RLHF, Constitutional AI, debate). Requires training access but somewhat architecture-agnostic.
RLHF, Constitutional AI, debate
MEDIUM
  • Training access available?
  • Gradient-based training continues?
  • Single trainable system?
  • Long distillation chains?
Black-box Evaluations
Behavioral testing, capability evals, red-teaming. Only requires query access, relatively architecture-agnostic.
Capability evals, red-teaming, benchmarks
MEDIUM-HIGH
  • Query access available?
  • Behavior predictable enough?
  • Emergent multi-agent behavior?
Control & Containment
Boxing, monitoring, tripwires, capability control. Focuses on constraining systems regardless of their internals.
Sandboxing, monitoring, kill switches
HIGH
  • Sandboxing feasible?
  • Monitoring effective?
  • Capability boundaries clear?
Few threats identified
Theoretical Alignment
Mathematical frameworks, optimization theory, agent foundations. Architecture-independent by nature.
Agent foundations, decision theory, formal frameworks
HIGHEST
  • Math applies to real systems?
Few threats identified
5 safety approaches