Goodhart's Law empirically confirmed
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
A foundational reference for understanding reward hacking and specification gaming in AI systems; the four-way taxonomy (Regressional, Extremal, Causal, Adversarial) is widely cited in AI safety literature when discussing proxy misalignment and overoptimization risks.
Paper Details
Metadata
Abstract
There are several distinct failure modes for overoptimization of systems on the basis of metrics. This occurs when a metric which can be used to improve a system is used to an extent that further optimization is ineffective or harmful, and is sometimes termed Goodhart's Law. This class of failure is often poorly understood, partly because terminology for discussing them is ambiguous, and partly because discussion using this ambiguous terminology ignores distinctions between different failure modes of this general type. This paper expands on an earlier discussion by Garrabrant, which notes there are "(at least) four different mechanisms" that relate to Goodhart's Law. This paper is intended to explore these mechanisms further, and specify more clearly how they occur. This discussion should be helpful in better understanding these types of failures in economic regulation, in public policy, in machine learning, and in Artificial Intelligence alignment. The importance of Goodhart effects depends on the amount of power directed towards optimizing the proxy, and so the increased optimization power offered by artificial intelligence makes it especially critical for that field.
Summary
This paper expands Garrabrant's framework to identify and clarify at least four distinct mechanisms underlying Goodhart's Law—where optimizing a proxy metric causes it to cease being a reliable measure of the intended goal. The authors demonstrate these failure modes across economics, public policy, machine learning, and AI alignment, arguing that AI's increased optimization power makes these failures especially critical to understand.
Key Points
- •Identifies at least four distinct failure mechanisms behind Goodhart's Law: Regressional, Extremal, Causal, and Adversarial Goodhart effects.
- •Clarifies that ambiguous terminology has led to conflation of distinct failure modes, obscuring their different causes and remedies.
- •The severity of Goodhart effects scales with the optimization power applied, making AI systems especially vulnerable.
- •Applies the taxonomy across domains including economic regulation, public policy, ML training objectives, and AI alignment.
- •Builds on Garrabrant's earlier informal taxonomy, providing more formal and operational definitions of each failure mode.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Reward Hacking Taxonomy and Severity Model | Analysis | 71.0 |
| Reward Hacking | Risk | 91.0 |
Cached Content Preview
# Categorizing Variants of Goodhart’s Law
David ManheimScott Garrabrantdavidmanheim@gmail.comscott@intelligence.org
There are several distinct failure modes for overoptimization of systems on the basis of metrics. This occurs when a metric which can be used to improve a system is used to such an extent that further optimization is ineffective or harmful, and is sometimes termed Goodhart’s Law111As a historical note, Goodhart’s Law \[ [1](https://ar5iv.labs.arxiv.org/html/1803.04585#bib.bib1 "")\] as originally formulated states that “any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” This has been interpreted and explained more widely, perhaps to the point where it is ambiguous what the term means. Other closely related formulations, such as Campbell’s law (which arguably has scholarly precedence\[ [3](https://ar5iv.labs.arxiv.org/html/1803.04585#bib.bib3 "")\]) and the Lucas critique, were also initially specific, and their interpretation has also been expanded greatly. Lastly, the Cobra Effect and perverse incentives are often closely related to these failures, and the different effects interact. Because none of the terms were laid out formally, the categories proposed do not match what was originally discussed. A separate forthcoming paper intends to address the relationship between those formulations and the categories more formally explained here.. This class of failure is often poorly understood, partly because terminology for discussing them is ambiguous, and partly because discussion using this ambiguous terminology ignores distinctions between different failure modes of this general type.
This paper expands on an earlier discussion by Garrabrant \[ [2](https://ar5iv.labs.arxiv.org/html/1803.04585#bib.bib2 "")\], which notes there are “(at least) four different mechanisms” that relate to Goodhart’s Law. This paper is intended to explore these mechanisms further, and specify more clearly how they occur. This discussion should be helpful in better understanding these types of failures in economic regulation, in public policy, in machine learning, and in artificial intelligence alignment\[ [4](https://ar5iv.labs.arxiv.org/html/1803.04585#bib.bib4 "")\]. The importance of Goodhart effects depends on the amount of power directed towards optimizing the proxy, and so the increased optimizationpower offered by artificial intelligence makes it especially critical for that field.
## Varieties of Goodhart-like Phenomena
As used in this paper, a Goodhart effect is when optimization causes a collapse of the statistical relationship between a goal which the optimizer intends and the proxy used for that goal. The four categories of Goodhart effects introduced by Garrabrant are
1)Regressional, where selection for an imperfect proxy necessarily also selects for noise, 2)Extremal, where selection for the metric pushes the state distribution into a region where old relationships no longer ho
... (truncated, 27 KB total)e178cbf073c0fbce | Stable ID: MjBhZGNhNT