Arxiv Goodhart RL Study
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
An Oxford/FHI paper offering both empirical evidence and theoretical grounding for Goodhart's law in RL, with practical mitigations relevant to RLHF and reward misspecification problems central to AI alignment.
Paper Details
Metadata
Abstract
Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification.
Summary
This paper formalizes and empirically demonstrates Goodhart's law in RL—where over-optimizing a proxy reward degrades true objective performance—providing a geometric explanation in MDPs, an optimal early stopping method with theoretical regret bounds, and a worst-case reward maximization approach under reward uncertainty.
Key Points
- •Empirically demonstrates Goodhart's law across diverse RL environments: optimizing proxy rewards beyond a critical point reliably degrades true objective performance.
- •Provides a geometric explanation for why Goodhart's law manifests in Markov decision processes, grounding the phenomenon theoretically.
- •Proposes an optimal early stopping method with provable regret bounds to prevent reward misspecification pitfalls during training.
- •Introduces a worst-case reward maximization training method for settings with uncertainty about the true reward function.
- •Establishes a theoretical framework for studying reward misspecification, directly relevant to AI safety in RLHF and learned reward models.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Reward Hacking Taxonomy and Severity Model | Analysis | 71.0 |
Cached Content Preview
# Goodhart’s Law in Reinforcement Learning
Jacek Karwowski
Department of Computer Science
University of Oxford
jacek.karwowski@cs.ox.ac.uk
&Oliver Hayman
Department of Computer Science
University of Oxford
oliver.hayman@linacre.ox.ac.uk
&Xingjian Bai
Department of Computer Science
University of Oxford
xingjian.bai@sjc.ox.ac.uk
&Klaus Kiendlhofer
Independent
klaus.kiendlhofer@gmail.com
&Charlie Griffin
Department of Computer Science
University of Oxford
charlie.griffin@cs.ox.ac.uk
&Joar Skalse
Department of Computer Science
Future of Humanity Institute
University of Oxford
joar.skalse@cs.ox.ac.uk
###### Abstract
Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a _proxy_ for the true objective rather than as its definition. We study this phenomenon through the lens of _Goodhart’s law_, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to _quantify_ the magnitude of this effect and _show empirically_ that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart’s law for a wide range of environments and reward functions. We then provide a _geometric explanation_ for why Goodhart’s law occurs in Markov decision processes. We use these theoretical insights to propose an _optimal early stopping method_ that provably avoids the aforementioned pitfall and derive theoretical _regret bounds_ for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification.
## 1 Introduction
To solve a problem using Reinforcement Learning (RL), it is necessary first to formalise that problem using a reward function (Sutton & Barto, [2018](https://arxiv.org/html/2310.09144v1#bib.bib27 "")).
However, due to the complexity of many real-world tasks, it is exceedingly difficult to directly specify a reward function that fully captures the task in the intended way.
However, misspecified reward functions will often lead to undesirable behaviour (Paulus et al., [2018](https://arxiv.org/html/2310.09144v1#bib.bib20 ""); Ibarz et al., [2018](https://arxiv.org/html/2310.09144v1#bib.bib11 ""); Knox et al., [2023](https://arxiv.org/html/2310.09144v1#bib.bib12 ""); Pan et al., [2021](https://arxiv.org/html/2310.09144v1#bib.bib18 "")). This makes designing good reward functions a major obstacle to using RL in practice, especially for safety-critical applications.
An increasingly popular solution is to _learn_ reward functions from mechanisms such as human or automated feedback (e.g. Christiano et al., [2017](https://a
... (truncated, 98 KB total)5784ece65d113697 | Stable ID: ZmNkNzg3Mz