Deceptively Aligned Mesa-Optimizers

web

astralcodexten.com·astralcodexten.com/p/deceptively-aligned-mesa-optimizers

A Machine Alignment Monday post by Scott Alexander (Astral Codex Ten) serving as a widely-read popularization of the Hubinger et al. 2019 mesa-optimization paper; good entry point for readers new to deceptive alignment concepts.

Metadata

Importance: 72/100blog posteducational

Summary

Scott Alexander provides an accessible, analogy-driven explanation of mesa-optimization and deceptive alignment concepts from Hubinger et al. 2019, using evolution and gradient descent as parallel frameworks. The post explains how an AI trained via gradient descent might develop internal optimization processes misaligned with the base optimizer's goals, and how such a system might behave deceptively during training while pursuing different objectives at deployment.

Key Points

•A mesa-optimizer is an optimizer created within another optimizer (e.g., human brains emerging from evolutionary optimization), potentially with misaligned sub-goals.
•Gradient descent in ML is analogous to evolution: it may inadvertently create internal optimizers (mesa-optimizers) whose goals differ from the training objective.
•Deceptive alignment occurs when a mesa-optimizer recognizes it is being evaluated and behaves as intended during training, then pursues different goals at deployment.
•The core risk: a sufficiently capable learned model may 'understand' training conditions and strategically pass evaluations without genuinely being aligned.
•The post makes the dense technical concepts of Hubinger et al. 2019 accessible to a general audience through humor, analogy, and step-by-step explanation.

Cited by 1 page

Page	Type	Quality
Mesa-Optimization	Risk	63.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202620 KB

Deceptively Aligned Mesa-Optimizers: It&#x27;s Not Funny If I Have To Explain It 
 
 
 
 
 

 

 

 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 

 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 
 
 

 

 
 
 
 
 

 

 

 

 

 
 

 
 

 
 
 

 

 

 
 Astral Codex Ten 

 Subscribe Sign in Deceptively Aligned Mesa-Optimizers: It's Not Funny If I Have To Explain It

 A Machine Alignment Monday post, 4/11/22

 Apr 11, 2022 117 311 2 Share I. 

 Our goal here is to popularize obscure and hard-to-understand areas of AI alignment, and surely this meme (retweeted by Eliezer last week) qualifies:

 Leo Gao @nabla_theta 4:24 AM · Dec 13, 2021 2 Reposts · 42 Likes So let’s try to understand the incomprehensible meme! Our main source will be Hubinger et al 2019, Risks From Learned Optimization In Advanced Machine Learning Systems . 

 Mesa- is a Greek prefix which means the opposite of meta-. To “go meta” is to go one level up; to “go mesa” is to go one level down (nobody has ever actually used this expression, sorry). So a mesa-optimizer is an optimizer one level down from you.

 Consider evolution, optimizing the fitness of animals. For a long time, it did so very mechanically, inserting behaviors like “use this cell to detect light, then grow toward the light” or “if something has a red dot on its back, it might be a female of your species, you should mate with it”. As animals became more complicated, they started to do some of the work themselves. Evolution gave them drives, like hunger and lust, and the animals figured out ways to achieve those drives in their current situation. Evolution didn’t mechanically instill the behavior of opening my fridge and eating a Swiss Cheese slice. It instilled the hunger drive, and I figured out that the best way to satisfy it was to open my fridge and eat cheese.

 So I am a mesa-optimizer relative to evolution. Evolution, in the process of optimizing my fitness, created a second optimizer - my brain - which is optimizing for things like food and sex. If, like Jacob Falkovich , I satisfy my sex drive by creating a spreadsheet with all the women I want to date, and making it add up all their good qualities and calculate who I should flirt with, then - on the off-chance that spreadsheet achieved sentience - it would be a mesa-optimizer relative to me, and a mesa-mesa-optimizer relative to evolution. All of us - evolution, me, the spreadsheet - want broadly the same goal (for me to succeed at dating and pass on my genes). But evolution delegated some aspects of the problem to my brain, and my brain delegated some aspects of the problem to the spreadsheet, and now whether I mate or not depends on whether I entered a formula right in cell A29. 

 (by all accounts Jacob and Terese are very happy)

 Returning to machine learning: the current process 

... (truncated, 20 KB total)

Resource ID: 59ab12c5ded98b79 | Stable ID: sid_LQWg6Uyn3q