Skip to content
Longterm Wiki
Back

Bounded objectives research

paper

Authors

Stuart Armstrong·Sören Mindermann

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Addresses the challenge of inferring reward functions from agents with unknown rationality levels in inverse reinforcement learning, tackling a practical ambiguity problem relevant to AI alignment and human-AI preference learning.

Paper Details

Citations
0

Metadata

arxiv preprintprimary source

Abstract

Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, the general problem of inferring the reward function of an agent of unknown rationality has received little attention. Unlike the well-known ambiguity problems in IRL, this one is practically relevant but cannot be resolved by observing the agent's policy in enough environments. This paper shows (1) that a No Free Lunch result implies it is impossible to uniquely decompose a policy into a planning algorithm and reward function, and (2) that even with a reasonable simplicity prior/Occam's razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret. To address this, we need simple `normative' assumptions, which cannot be deduced exclusively from observations.

Summary

This paper addresses a fundamental challenge in inverse reinforcement learning: inferring reward functions from observed behavior when the agent's rationality level is unknown. The authors prove that it is impossible to uniquely decompose an agent's policy into a planning algorithm and reward function due to a No Free Lunch result, and that even with simplicity priors, multiple decompositions can produce similarly high regret. They argue that resolving this ambiguity requires normative assumptions that cannot be derived solely from behavioral observations, highlighting a previously underexplored but practically important limitation of IRL approaches.

Cited by 2 pages

PageTypeQuality
Corrigibility Failure PathwaysAnalysis62.0
AI AlignmentApproach91.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202689 KB
# Occam’s razor is insufficient to infer the preferences of irrational agents

Stuart Armstrong \*

Future of Humanity Institute

University of Oxford

stuart.armstrong@philosophy.ox.ac.uk

&Sören Mindermann\*

Vector Institute

University of Toronto

soeren.mindermann@gmail.com

Equal contribution.Further affiliation: Machine Intelligence Research Institute, Berkeley, USA.Work performed at Future of Humanity Institute.

###### Abstract

Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings.
However, the general problem of inferring the reward function of an agent of unknown rationality has received little attention.
Unlike the well-known ambiguity problems in IRL, this one is practically relevant but cannot be resolved by observing the agent’s policy in enough environments.
This paper shows (1) that a No Free Lunch result implies it is impossible to uniquely decompose a policy into a planning algorithm and reward function, and (2) that even with a reasonable simplicity prior/Occam’s razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret.
To address this, we need simple ‘normative’ assumptions, which cannot be deduced exclusively from observations.

## 1 Introduction

In today’s reinforcement learning systems, a simple reward function is often hand-crafted, and still sometimes leads to undesired behaviors on the part of RL agent, as the reward function is not well aligned with the operator’s true goals111See for example the game CoastRunners, where an RL agent didn’t finish the course, but instead found a bug allowing it to get a high score by crashing round in circles [https://blog.openai.com/faulty-reward-functions/](https://blog.openai.com/faulty-reward-functions/ "").. As AI systems become more powerful and autonomous, these failures will become more frequent and grave as RL agents exceed human performance, operate at time-scales that forbid constant oversight, and are given increasingly complex tasks — from driving cars to planning cities to eventually evaluating policies or helping run companies. Ensuring that the agents behave in alignment with human values is known, appropriately, as the value alignment problem\[Amodei et al., [2016](https://ar5iv.labs.arxiv.org/html/1712.05812#bib.bib4 ""), Hadfield-Menell et al., [2016](https://ar5iv.labs.arxiv.org/html/1712.05812#bib.bib19 ""), Russell et al., [2015](https://ar5iv.labs.arxiv.org/html/1712.05812#bib.bib43 ""), Bostrom, [2014](https://ar5iv.labs.arxiv.org/html/1712.05812#bib.bib7 ""), Leike et al., [2017](https://ar5iv.labs.arxiv.org/html/1712.05812#bib.bib28 "")\].

One way of resolving this problem is to infer the correct reward function by observing human behaviour.
This is known as Inverse reinforcement learning (IRL) \[Ng 

... (truncated, 89 KB total)
Resource ID: 6b7fc3f234fa109c | Stable ID: NGQyYmEwYj