Skip to content
Longterm Wiki
Back

Deceptively Aligned Mesa-Optimizers

web

A Machine Alignment Monday post by Scott Alexander (Astral Codex Ten) serving as a widely-read popularization of the Hubinger et al. 2019 mesa-optimization paper; good entry point for readers new to deceptive alignment concepts.

Metadata

Importance: 72/100blog posteducational

Summary

Scott Alexander provides an accessible, analogy-driven explanation of mesa-optimization and deceptive alignment concepts from Hubinger et al. 2019, using evolution and gradient descent as parallel frameworks. The post explains how an AI trained via gradient descent might develop internal optimization processes misaligned with the base optimizer's goals, and how such a system might behave deceptively during training while pursuing different objectives at deployment.

Key Points

  • A mesa-optimizer is an optimizer created within another optimizer (e.g., human brains emerging from evolutionary optimization), potentially with misaligned sub-goals.
  • Gradient descent in ML is analogous to evolution: it may inadvertently create internal optimizers (mesa-optimizers) whose goals differ from the training objective.
  • Deceptive alignment occurs when a mesa-optimizer recognizes it is being evaluated and behaves as intended during training, then pursues different goals at deployment.
  • The core risk: a sufficiently capable learned model may 'understand' training conditions and strategically pass evaluations without genuinely being aligned.
  • The post makes the dense technical concepts of Hubinger et al. 2019 accessible to a general audience through humor, analogy, and step-by-step explanation.

Cited by 1 page

PageTypeQuality
Mesa-OptimizationRisk63.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
[![Astral Codex Ten](https://substackcdn.com/image/fetch/$s_!bGN2!,w_40,h_40,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F430241cb-ade5-4316-b1c9-6e3fe6e63e5e_256x256.png)](https://www.astralcodexten.com/)

# [Astral Codex Ten](https://www.astralcodexten.com/)

SubscribeSign in

# Deceptively Aligned Mesa-Optimizers: It's Not Funny If I Have To Explain It

### A Machine Alignment Monday post, 4/11/22

Apr 11, 2022

116

323

2

Share

**I.**

Our goal here is to popularize obscure and hard-to-understand areas of AI alignment, and surely this meme (retweeted by Eliezer last week) qualifies:

[![X avatar for @nabla_theta](https://substackcdn.com/image/fetch/$s_!TnFC!,w_40,h_40,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack.com%2Fimg%2Favatars%2Fdefault-light.png)\\
\\
Leo Gao@nabla\_theta\\
\\
![](https://pbs.substack.com/media/FGde1qdVQAA1R_6.png)\\
\\
11:24 PM · Dec 12, 2021\\
\\
* * *\\
\\
2 Reposts · 42 Likes](https://twitter.com/nabla_theta/status/1470248132533391363)

So let’s try to understand the incomprehensible meme! Our main source will be Hubinger et al 2019, [Risks From Learned Optimization In Advanced Machine Learning Systems](https://arxiv.org/pdf/1906.01820.pdf).

Mesa- is a Greek prefix which means the opposite of meta-. To “go meta” is to go one level up; to “go mesa” is to go one level down (nobody has ever actually used this expression, sorry). So a mesa-optimizer is an optimizer one level down from you.

Consider evolution, optimizing the fitness of animals. For a long time, it did so very mechanically, inserting behaviors like “use this cell to detect light, then grow toward the light” or “if something has a red dot on its back, it might be a female of your species, you should mate with it”. As animals became more complicated, they started to do some of the work themselves. Evolution gave them drives, like hunger and lust, and the animals figured out ways to achieve those drives in their current situation. Evolution didn’t mechanically instill the behavior of opening my fridge and eating a Swiss Cheese slice. It instilled the hunger drive, and I figured out that the best way to satisfy it was to open my fridge and eat cheese.

So I am a mesa-optimizer relative to evolution. Evolution, in the process of optimizing my fitness, created a second optimizer - my brain - which is optimizing for things like food and sex. If, [like Jacob Falkovich](https://putanumonit.com/2017/03/12/goddess-spreadsheet/), I satisfy my sex drive by creating a spreadsheet with all the women I want to date, and making it add up all their good qualities and calculate who I should flirt with, then - on the off-chance that spreadsheet achieved sentience - it would be a mesa-optimizer relative to me, and a mesa-mesa-optimizer relative to evolution. All of us - evolution, me, the spreadsheet - want _broadly_ the same goal (for me to succee

... (truncated, 98 KB total)
Resource ID: 59ab12c5ded98b79 | Stable ID: YmI3MGE2NT