Citation

Model Organisms of Misalignment - Footnote 55

confirmed90% confidence

1 evidence check

Last checked: 4/3/2026

Migrated from citation_quotes. Original verdict: accurate

Evidence — 1 source, 1 check

axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html(1 check)

confirmed90%Haiku 4.5 · 4/3/2026

Found: **Non-robust behaviors**: Deceptive alignment reasoning can actually reduce robustness in small models, and the generalization properties of RLHF remain debated—behaviors may be neither fully context-…

Note: Migrated from citation_quotes accuracy check. Original verdict: accurate

Debug info

Record type: citation

Record ID: page:model-organisms-of-misalignment:fn55