Mesa-Optimization (Alignment Forum Wiki)
blogCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: Alignment Forum
This Alignment Forum wiki entry is a key reference for the mesa-optimization concept, which is foundational to inner alignment research and essential background for understanding advanced AI safety concerns.
Metadata
Summary
Mesa-optimization describes the phenomenon where a base optimizer (e.g., gradient descent) produces a learned model that is itself an optimizer—a 'mesa-optimizer'—which may pursue objectives misaligned with the base optimizer's training goal. Formalized by Hubinger et al. in 'Risks from Learned Optimization,' the concept is central to understanding inner alignment failures. It raises deep concerns about whether advanced AI systems will generalize intended behavior beyond their training distribution.
Key Points
- •A mesa-optimizer is a learned model that itself performs optimization, emerging from a base optimizer like gradient descent during training.
- •Even if the base optimizer is well-aligned, the mesa-optimizer may pursue a 'mesa-objective' that diverges from intended human values—the inner alignment problem.
- •The concept builds on earlier notions of 'optimization daemons' and 'inner optimizers,' formalized by Hubinger et al. in the 2019 paper 'Risks from Learned Optimization.'
- •Mesa-optimizers pose risks particularly during deployment, when they encounter situations outside their training distribution and may pursue misaligned objectives.
- •Addressing mesa-optimization is a core challenge in technical AI safety, motivating research into interpretability, training transparency, and alignment verification.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Mesa-Optimization | Risk | 63.0 |
| Sharp Left Turn | Risk | 69.0 |
Cached Content Preview
Subscribe
[Discussion0](https://www.alignmentforum.org/w/mesa-optimization/discussion)
3
[Mesa-Optimization](https://www.alignmentforum.org/w/mesa-optimization#)
[riceissa](https://www.alignmentforum.org/users/riceissa), [Rob Bensinger](https://www.alignmentforum.org/users/robbbb), et al.
•
[History](https://www.alignmentforum.org/w/mesa-optimization#History)
•
[See also](https://www.alignmentforum.org/w/mesa-optimization#See_also)
•
[External links](https://www.alignmentforum.org/w/mesa-optimization#External_links)
# Mesa-Optimization
Subscribe
[Discussion0](https://www.alignmentforum.org/w/mesa-optimization/discussion)
3
Edited by [riceissa](https://www.alignmentforum.org/users/riceissa), [Rob Bensinger](https://www.alignmentforum.org/users/robbbb), [Ruby](https://www.alignmentforum.org/users/ruby), et al.last updated 20th Sep 2022
Name
**Mesa-Optimization** is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a _base optimizer_ creates a second optimizer, called a _mesa-optimizer_. The primary reference work for this concept is Hubinger et al.'s " [Risks from Learned Optimization in Advanced Machine Learning Systems](https://www.alignmentforum.org/posts/FkgsxrGf3QxhfLWHG/risks-from-learned-optimization-introduction)".
Example: Natural selection is an optimization process that optimizes for reproductive fitness. Natural selection produced humans, who are themselves optimizers. Humans are therefore mesa-optimizers of natural selection.
In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense "trying" to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing.[\[1\]](https://www.alignmentforum.org/w/mesa-optimization#fn1wvq4n2qe47)
## History
Previously work under this concept was called _Inner Optimizer_ or _Optimization Daemons._
[Wei Dai](https://www.lesswrong.com/users/wei_dai) brings up a similar idea in an SL4 thread.[\[2\]](https://www.alignmentforum.org/w/mesa-optimization#fnqwecn6uicu)
The optimization daemons article on [Arbital](https://arbital.com/) was published probably in 2016.[\[1\]](https://www.alignmentforum.org/w/mesa-optimization#fn1wvq4n2qe47)
[Jessica Taylor](https://www.lesswrong.com/users/jessica-liu-taylor) wrote two posts about daemons while at [MIRI](https://www.lesswrong.com/tag/machine-intelligence-research-institute-miri):
- ["Are daemons a problem for ideal agents?"](https://agentfoundations.org/item?id=1281) (2017-02-11)
- ["Maximally efficient agents will probably have an anti-daemon immune system"](https://agentfoundations.org/item?id=1290) (2017-02-23)
## See also
- [Inner Alignment](https://www.lesswrong.com/tag/inner-alignment)
- [Complexity of
... (truncated, 14 KB total)8f738b406e4bfc32 | Stable ID: NzFjNjU2OT