Arxiv Goodhart RL Study

paper

2023·arXiv·arxiv.org/html/2310.09144v1

Authors

Jacek Karwowski·Oliver Hayman·Xingjian Bai·Klaus Kiendlhofer·Charlie Griffin·Joar Skalse

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

An Oxford/FHI paper offering both empirical evidence and theoretical grounding for Goodhart's law in RL, with practical mitigations relevant to RLHF and reward misspecification problems central to AI alignment.

Paper Details

Citations

0 influential

Year

2023

arXiv:2310.09144 DOI:10.48550/arXiv.2310.09144 Semantic Scholar

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification.

Summary

This paper formalizes and empirically demonstrates Goodhart's law in RL—where over-optimizing a proxy reward degrades true objective performance—providing a geometric explanation in MDPs, an optimal early stopping method with theoretical regret bounds, and a worst-case reward maximization approach under reward uncertainty.

Key Points

•Empirically demonstrates Goodhart's law across diverse RL environments: optimizing proxy rewards beyond a critical point reliably degrades true objective performance.
•Provides a geometric explanation for why Goodhart's law manifests in Markov decision processes, grounding the phenomenon theoretically.
•Proposes an optimal early stopping method with provable regret bounds to prevent reward misspecification pitfalls during training.
•Introduces a worst-case reward maximization training method for settings with uncertainty about the true reward function.
•Establishes a theoretical framework for studying reward misspecification, directly relevant to AI safety in RLHF and learned reward models.

Cited by 1 page

Page	Type	Quality
Reward Hacking Taxonomy and Severity Model	Analysis	71.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

Goodhart’s Law in Reinforcement Learning 
 
 
 
 
 

 
 

 
 
 
 
 Goodhart’s Law in Reinforcement Learning

 
 
 Jacek Karwowski
 Department of Computer Science
 University of Oxford
 jacek.karwowski@cs.ox.ac.uk 
&Oliver Hayman
 Department of Computer Science
 University of Oxford
 oliver.hayman@linacre.ox.ac.uk 
&Xingjian Bai
 Department of Computer Science
 University of Oxford
 xingjian.bai@sjc.ox.ac.uk 
&Klaus Kiendlhofer
 Independent
 klaus.kiendlhofer@gmail.com 
&Charlie Griffin
 Department of Computer Science
 University of Oxford
 charlie.griffin@cs.ox.ac.uk 
&Joar Skalse
 Department of Computer Science
 Future of Humanity Institute
 University of Oxford
 joar.skalse@cs.ox.ac.uk 
 
 

 
 Abstract

 Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart’s law , which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart’s law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart’s law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification.

 
 
 
 1 Introduction

 
 To solve a problem using Reinforcement Learning (RL), it is necessary first to formalise that problem using a reward function (Sutton & Barto, 2018 ) .
However, due to the complexity of many real-world tasks, it is exceedingly difficult to directly specify a reward function that fully captures the task in the intended way.
However, misspecified reward functions will often lead to undesirable behaviour (Paulus et al., 2018 ; Ibarz et al., 2018 ; Knox et al., 2023 ; Pan et al., 2021 ) . This makes designing good reward functions a major obstacle to using RL in practice, especially for safety-critical applications.

 
 
 An increasingly popular solution is to learn reward functions from mechanisms such as human or automated feedback (e.g.  Christiano et al., 2017 ; Ng & Russell, 2000 ) .
However, this approach comes with its own set of challenges:
the right data can be difficult to collect (e.g.  Paulus et al., 2018 ) , and it is often challenging to

... (truncated, 98 KB total)

Resource ID: 5784ece65d113697 | Stable ID: sid_NaX6vLuGnN