Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

blog

2023·Alignment Forum·alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organism...

Authors

evhub·Nicholas Schiefer·Carson Denison·Ethan Perez

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

An influential Alignment Forum post (Best of LessWrong 2023) that helped establish a research agenda around empirically demonstrating misalignment failure modes; closely related to Anthropic's sleeper agents work and broader efforts to make alignment risks legible to policymakers and researchers.

Metadata

Importance: 82/100blog postprimary source

Summary

This post proposes creating 'model organisms of misalignment'—controlled, reproducible experimental settings that deliberately demonstrate specific AI failure modes such as deceptive alignment and reward hacking. Analogizing to biological model organisms in scientific research, the authors argue this approach would systematically advance alignment science and facilitate global coordination on AI safety standards. The framework provides a roadmap for empirically studying dangerous misalignment behaviors in tractable settings before they appear in frontier systems.

Key Points

•Proposes 'model organisms of misalignment': simplified systems engineered to reliably exhibit specific dangerous behaviors like deceptive alignment or reward hacking for study.
•Draws analogy to biological model organisms (e.g., fruit flies, mice) that enable reproducible, tractable study of complex phenomena in controlled conditions.
•Argues this approach serves both scientific goals (understanding misalignment mechanisms) and coordination goals (providing concrete evidence to motivate policy and safety standards).
•Outlines specific example training setups for deceptive inner misalignment and reward hacking, with varying degrees of realism and complexity.
•Addresses potential objections including dual-use risks of demonstrating dangerous capabilities, arguing benefits to safety research outweigh concerns.

Cited by 1 page

Page	Type	Quality
Model Organisms of Misalignment	Analysis	65.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202666 KB

![Background Image](https://res.cloudinary.com/lesswrong-2-0/image/upload/a_hflip/f_auto,q_auto/v1/mirroredImages/Model%20Organisms%20of%20Misalignment:%20The%20Case%20for%20a%20New%20Pillar%20of%20Alignment%20Research_a_scientist_observing_miniature__0.06850318124727539/h2dj9mj3u5zvu3igqvdn)

[Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#)

22 min read

•

[The Problem](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#The_Problem)

•

[The Plan](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#The_Plan)

•

[Roadmap](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#Roadmap)

•

[Possible subcomponents of AI takeover to demonstrate](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#Possible_subcomponents_of_AI_takeover_to_demonstrate)

•

[Each subcomponent can be demonstrated in settings with various degrees of realism](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#Each_subcomponent_can_be_demonstrated_in_settings_with_various_degrees_of_realism)

•

[Approach](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#Approach)

•

[What we learn with model organisms](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#What_we_learn_with_model_organisms)

•

[Techniques for developing good demonstrations](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#Techniques_for_developing_good_demonstrations)

•

[Meta](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#Meta)

•

[Examples of Model Organism Training Setups](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#Examples_of_Model_Organism_Training_Setups)

•

[Deceptive inner misalignment setups](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#Deceptive_inner_misalignment_setups)

•

[Reward hacking setups](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#Reward_hacking_setups)

•

[The Case for Model Organisms](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1#The_Case_for_Model_Organisms)

•

[The Scientific Case](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model

... (truncated, 66 KB total)

Resource ID: 2dc41a9727c99ef4 | Stable ID: sid_fFkGoQedHr