OpenAI Goodhart Measurement

web

OpenAI·openai.com/index/measuring-goodharts-law/

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

A 2022 OpenAI research post providing formal analysis of reward model overoptimization, foundational to understanding failure modes in RLHF-based alignment approaches.

Metadata

Importance: 72/100blog postprimary source

Summary

OpenAI researchers provide a mathematical framework for analyzing Goodhart's Law in the context of reward modeling, focusing on best-of-n sampling as a tractable setting to measure the gap between proxy reward optimization and true objective performance. They derive closed-form approximations relating KL divergence from the original distribution to expected true reward, enabling empirical tracking of reward model overoptimization.

Key Points

•Introduces mathematical tools to quantify the Goodhart effect: as proxy reward is over-optimized, true reward initially increases then degrades.
•Best-of-n (rejection) sampling is analyzed as a clean, tractable method for studying proxy vs. true objective alignment empirically.
•KL divergence from the base distribution serves as a measure of how much the optimized policy has shifted, enabling systematic study of overoptimization.
•Findings are directly relevant to RLHF pipelines where reward models serve as proxies for human preferences in models like GPT-3.
•Even simple best-of-64 sampling can outperform RL-trained models, highlighting inference-time compute as a tradeoff against training optimization.

Cited by 2 pages

Page	Type	Quality
Reward Hacking Taxonomy and Severity Model	Analysis	71.0
Reward Hacking	Risk	91.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20269 KB

Measuring Goodhart’s law | OpenAI OpenAI April 13, 2022

 Publication Measuring Goodhart’s law

 Loading… Share Goodhart’s law famously says: “When a measure becomes a target, it ceases to be a good measure.” Although originally from economics, it’s something we have to grapple with at OpenAI when figuring out how to optimize objectives that are difficult or costly to measure. 

 Goodhart’s law ⁠ (opens in a new window)  famously says: “When a measure becomes a target, it ceases to be a good measure.” Although originally from economics, it’s something we have to grapple with at OpenAI when figuring out how to optimize objectives that are difficult or costly to measure. It’s often necessary to introduce some  proxy objective  that’s easier or cheaper to measure, but when we do this, we need to be careful not to optimize it too much. 

 For example, as part of our work to  align ⁠  models like GPT‑3 with human intent and values, we would like to optimize things like “How  helpful ⁠  is this response?”, or “How  factually accurate ⁠  is this claim?”. These are complex objectives that require humans to carefully check things over. For this reason, we train a model to predict these human preferences, known as a  reward model , and use the reward model’s predictions as a proxy objective. But it’s important to keep track of how well the true objective is being optimized. 

 In this post we’ll look at some of the mathematics behind how we do this. We’ll focus on a setting that is particularly clean to analyze, in which we have access to the true objective. In practice, even human preferences can fail to measure what we really care about, but we’re setting that issue aside in this post. 

 Best-of-n sampling

 There are many ways in which one could optimize the proxy objective, but perhaps the simplest is  best-of- n n n sampling , also known as  rejection sampling  or  reranking . We simply sample n times and take the one that scores the highest according to the proxy objective. 

 Although this method is very simple, it can actually be competitive with more advanced techniques such as reinforcement learning, albeit at the cost of more inference-time compute. For example, in  WebGPT ⁠ , our best-of-64 model outperformed our reinforcement learning model, perhaps in part because the best-of-64 model got to browse many more websites. Even applying best-of-4 provided a significant boost to human preferences. 

 In addition, best-of- n n n sampling has reliable performance and is straightforward to analyze mathematically, making it well-suited to empirical studies of Goodhart’s law and related phenomena. 

 The mathematics of best-of-n sampling

 Let’s study best-of- n n n sampling more formally. Suppose we have some sample space  S S S  (such as the set of possible question-answer pairs), some probability distribution  P P P over  S S S , a true objective (or “reward”)  R true : S → R R_{\text{true}}:S\to\mathbb R R true  : S → R , and a proxy objective  R 

... (truncated, 9 KB total)

Resource ID: 58937cef1e4311e9 | Stable ID: sid_8YOryQP8Mz