Skip to content
Longterm Wiki
Back

OpenAI Goodhart Measurement

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

A 2022 OpenAI research post providing formal analysis of reward model overoptimization, foundational to understanding failure modes in RLHF-based alignment approaches.

Metadata

Importance: 72/100blog postprimary source

Summary

OpenAI researchers provide a mathematical framework for analyzing Goodhart's Law in the context of reward modeling, focusing on best-of-n sampling as a tractable setting to measure the gap between proxy reward optimization and true objective performance. They derive closed-form approximations relating KL divergence from the original distribution to expected true reward, enabling empirical tracking of reward model overoptimization.

Key Points

  • Introduces mathematical tools to quantify the Goodhart effect: as proxy reward is over-optimized, true reward initially increases then degrades.
  • Best-of-n (rejection) sampling is analyzed as a clean, tractable method for studying proxy vs. true objective alignment empirically.
  • KL divergence from the base distribution serves as a measure of how much the optimized policy has shifted, enabling systematic study of overoptimization.
  • Findings are directly relevant to RLHF pipelines where reward models serve as proxies for human preferences in models like GPT-3.
  • Even simple best-of-64 sampling can outperform RL-trained models, highlighting inference-time compute as a tradeoff against training optimization.

Cited by 2 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 202611 KB
Measuring Goodhart’s law \| OpenAI

April 13, 2022

[Publication](https://openai.com/research/index/publication/)

# Measuring Goodhart’s law

![Measuring Goodhart’s Law](https://images.ctfassets.net/kftzwdyauwt9/bc7398a0-41a2-49ac-550027540010/5da5f59157a3accae2bc9bd28fd47177/image-11.webp?w=3840&q=90&fm=webp)

Loading…

Share

Goodhart’s law famously says: “When a measure becomes a target, it ceases to be a good measure.” Although originally from economics, it’s something we have to grapple with at OpenAI when figuring out how to optimize objectives that are difficult or costly to measure.

[Goodhart’s law⁠(opens in a new window)](https://en.wikipedia.org/wiki/Goodhart%27s_law) famously says: “When a measure becomes a target, it ceases to be a good measure.” Although originally from economics, it’s something we have to grapple with at OpenAI when figuring out how to optimize objectives that are difficult or costly to measure. It’s often necessary to introduce some **proxy objective** that’s easier or cheaper to measure, but when we do this, we need to be careful not to optimize it too much.

For example, as part of our work to [align⁠](https://openai.com/alignment/) models like GPT‑3 with human intent and values, we would like to optimize things like “How [helpful⁠](https://openai.com/index/instruction-following/) is this response?”, or “How [factually accurate⁠](https://openai.com/index/webgpt/) is this claim?”. These are complex objectives that require humans to carefully check things over. For this reason, we train a model to predict these human preferences, known as a **reward model**, and use the reward model’s predictions as a proxy objective. But it’s important to keep track of how well the true objective is being optimized.

In this post we’ll look at some of the mathematics behind how we do this. We’ll focus on a setting that is particularly clean to analyze, in which we have access to the true objective. In practice, even human preferences can fail to measure what we really care about, but we’re setting that issue aside in this post.

## Best-of-n sampling

There are many ways in which one could optimize the proxy objective, but perhaps the simplest is **best-of-n n n sampling**, also known as **rejection sampling** or **reranking**. We simply sample n times and take the one that scores the highest according to the proxy objective.

Although this method is very simple, it can actually be competitive with more advanced techniques such as reinforcement learning, albeit at the cost of more inference-time compute. For example, in [WebGPT⁠](https://openai.com/index/webgpt/), our best-of-64 model outperformed our reinforcement learning model, perhaps in part because the best-of-64 model got to browse many more websites. Even applying best-of-4 provided a significant boost to human preferences.

In addition, best-of-n n n sampling has reliable performance and is straightforward to analyze mathematically, making it well-suited to empirical st

... (truncated, 11 KB total)
Resource ID: 58937cef1e4311e9 | Stable ID: NjdjNjk2Mz