Goodhart's Law (AI Alignment Forum Wiki)
blogCredibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: Alignment Forum
A foundational wiki entry on a core AI alignment concept; the Garrabrant taxonomy it summarizes is widely cited in technical alignment literature on specification gaming and reward hacking.
Metadata
Summary
This wiki entry explains Goodhart's Law—when a proxy measure becomes the optimization target, it ceases to be a good proxy—and its critical relevance to AI alignment. It presents Scott Garrabrant's taxonomy of four Goodhart failure modes: regressional, causal, extremal, and adversarial, each describing a distinct mechanism by which proxy measures break down under optimization pressure.
Key Points
- •Goodhart's Law: optimizing a proxy measure causes it to stop accurately representing the underlying goal it was meant to capture.
- •AI alignment risk: a powerful AI optimizing even a good proxy for human values may cause that proxy to catastrophically break down.
- •Regressional Goodhart: optimization selects for noise in the proxy, not just the true underlying goal.
- •Extremal Goodhart: at extreme values, the proxy-goal correlation observed under normal conditions may no longer hold.
- •Adversarial Goodhart: optimization creates incentives for agents to game the proxy, actively destroying its correlation with the true goal.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Reward Hacking | Risk | 91.0 |
Cached Content Preview
Subscribe
[Discussion0](https://www.alignmentforum.org/w/goodhart-s-law/discussion)
4
[Goodhart's Law](https://www.alignmentforum.org/w/goodhart-s-law#)
[Ruby](https://www.alignmentforum.org/users/ruby), [Vladimir\_Nesov](https://www.alignmentforum.org/users/vladimir_nesov), et al.
•
[Goodhart Taxonomy](https://www.alignmentforum.org/w/goodhart-s-law#Goodhart_Taxonomy)
•
[See Also](https://www.alignmentforum.org/w/goodhart-s-law#See_Also)
# Goodhart's Law
Subscribe
[Discussion0](https://www.alignmentforum.org/w/goodhart-s-law/discussion)
4
Edited by [Ruby](https://www.alignmentforum.org/users/ruby), [Vladimir\_Nesov](https://www.alignmentforum.org/users/vladimir_nesov), et al.last updated 19th Mar 2023
Name
### Summaries
There are no custom summaries written for this page, so users will see an excerpt from the beginning of the page when hovering over links to this page. You can create up to 3 custom summaries; by default you should avoid creating more than one summary unless the subject matter benefits substantially from multiple kinds of explanation.
To pick up a draggable item, press the space bar.
While dragging, use the arrow keys to move the item.
Press space again to drop the item in its new position, or press escape to cancel.
CancelSubmit
**Goodhart's Law**states that when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy. One form of Goodhart is demonstrated by the Soviet story of a factory graded on how many shoes they produced (a good proxy for productivity) – they soon began producing a higher number of tiny shoes. Useless, but the numbers look good.
Goodhart's Law is of particular relevance to [AI Alignment](https://www.alignmentforum.org/w/ai). Suppose you have something which is generally a good proxy for "the stuff that humans care about", it would be dangerous to have a powerful AI optimize for the proxy, in accordance with Goodhart's law, the proxy will breakdown.
## Goodhart Taxonomy
In[Goodhart Taxonomy](https://www.alignmentforum.org/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy), Scott Garrabrant identifies four kinds of Goodharting:
- Regressional Goodhart - When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.
- Causal Goodhart - When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.
- Extremal Goodhart - Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.
- Adversarial Goodhart - When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.
## See Also
- [Groupthink](https://www.alignmentforum.org/w/groupthink), [Information cascade](https://www.alignmentforum.org/w/information-ca
... (truncated, 8 KB total)752f82912008599a | Stable ID: MTZiNDM2Yz