Skip to content
Longterm Wiki
Back

Skalse et al. (2022)

paper

Authors

Joar Skalse·Nikolaus H. R. Howe·Dmitrii Krasheninnikov·David Krueger

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A foundational theoretical paper for understanding reward hacking; provides formal grounding for informal intuitions about Goodhart's Law and proxy gaming in AI systems, relevant to anyone working on reward specification or RLHF-based alignment.

Paper Details

Citations
106
3 influential
Year
2022

Metadata

Importance: 78/100arxiv preprintprimary source

Abstract

We provide the first formal definition of reward hacking, a phenomenon where optimizing an imperfect proxy reward function leads to poor performance according to the true reward function. We say that a proxy is unhackable if increasing the expected proxy return can never decrease the expected true return. Intuitively, it might be possible to create an unhackable proxy by leaving some terms out of the reward function (making it "narrower") or overlooking fine-grained distinctions between roughly equivalent outcomes, but we show this is usually not the case. A key insight is that the linearity of reward (in state-action visit counts) makes unhackability a very strong condition. In particular, for the set of all stochastic policies, two reward functions can only be unhackable if one of them is constant. We thus turn our attention to deterministic policies and finite sets of stochastic policies, where non-trivial unhackable pairs always exist, and establish necessary and sufficient conditions for the existence of simplifications, an important special case of unhackability. Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.

Summary

This paper provides the first formal mathematical definition of reward hacking and the concept of 'unhackable' proxy reward functions. The authors prove that unhackability is an extremely restrictive condition: for stochastic policies, two reward functions can only be unhackable if one is constant. The work establishes fundamental limits on using narrow proxy rewards for AI alignment.

Key Points

  • First formal definition of reward hacking: optimizing a proxy reward leads to decreased true reward performance.
  • Introduces 'unhackability': a proxy is unhackable if increasing proxy returns can never decrease true returns.
  • Key result: for all stochastic policies, unhackable reward pairs require one reward to be constant—nearly trivial.
  • Non-trivial unhackable pairs do exist for deterministic policies and finite policy sets, with necessary/sufficient conditions derived.
  • Reveals a fundamental tension between narrow task specification via reward functions and broad value alignment.

Cited by 1 page

PageTypeQuality
Reward HackingRisk91.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Defining and Characterizing Reward Hacking

Joar Skalse

University of Oxford

&Nikolaus H. R. Howe

Mila, Université de Montréal

&Dmitrii Krasheninnikov

University of Cambridge

&David Krueger11footnotemark: 1

University of Cambridge

Equal contribution. Correspondence to: [joar.mvs@gmail.com](mailto:joar.mvs@gmail.com ""), [david.scott.krueger@gmail.com](mailto:david.scott.krueger@gmail.com "")

###### Abstract

We provide the first formal definition of reward hacking, a phenomenon where optimizing an imperfect proxy reward function, ℛ~~ℛ\\mathcal{\\tilde{R}}, leads to poor performance according to the true reward function, ℛℛ\\mathcal{R}.
We say that a proxy is unhackable if increasing the expected proxy return can never decrease the expected true return.
Intuitively, it might
be possible
to create an unhackable proxy by leaving some terms out of the reward function (making it “narrower”) or overlooking fine-grained distinctions between roughly equivalent outcomes,
but we show this is usually not the case.
A key insight is that the linearity of reward (in state-action visit counts) makes unhackability a very strong condition.
In particular, for the set of all stochastic policies, two reward functions can only be unhackable if one of them is constant.
We thus turn our attention to deterministic policies and finite sets of stochastic policies, where non-trivial unhackable pairs always exist, and establish necessary and sufficient conditions for the existence of
simplifications, an important special case of unhackability.
Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.

## 1 Introduction

It is well known that optimising a proxy can lead to unintended outcomes:
a boat spins in circles collecting “powerups” instead of following the race track in a racing game (Clark and Amodei,, [2016](https://ar5iv.labs.arxiv.org/html/2209.13085#bib.bib9 ""));
an evolved circuit listens in on radio signals from nearby computers’ oscillators instead of building its own (Bird and Layzell,, [2002](https://ar5iv.labs.arxiv.org/html/2209.13085#bib.bib3 "")); universities reject the most qualified applicants in order to appear more selective and boost their ratings (Golden,, [2001](https://ar5iv.labs.arxiv.org/html/2209.13085#bib.bib17 "")).
In the context of reinforcement learning (RL), such failures are called reward hacking.

For AI systems that take actions in safety-critical real world environments such as autonomous vehicles,
algorithmic trading,
or content recommendation systems, these unintended outcomes can be catastrophic.
This makes it crucial to align autonomous AI systems with their users’ intentions.
Precisely specifying which behaviours are or are not desirable is challenging, however.
One approach to this specification problem is to learn an approximation of the true reward function (Ng et al.,, [2000](https://ar5iv.labs.arxiv.org/html/2209.13085#bib.bib30 ""); Ziebart,,

... (truncated, 98 KB total)
Resource ID: ebb3fd7c23aa1f49 | Stable ID: NGVjYzk1MD