Bounded objectives research
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Addresses the challenge of inferring reward functions from agents with unknown rationality levels in inverse reinforcement learning, tackling a practical ambiguity problem relevant to AI alignment and human-AI preference learning.
Paper Details
Metadata
Abstract
Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, the general problem of inferring the reward function of an agent of unknown rationality has received little attention. Unlike the well-known ambiguity problems in IRL, this one is practically relevant but cannot be resolved by observing the agent's policy in enough environments. This paper shows (1) that a No Free Lunch result implies it is impossible to uniquely decompose a policy into a planning algorithm and reward function, and (2) that even with a reasonable simplicity prior/Occam's razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret. To address this, we need simple `normative' assumptions, which cannot be deduced exclusively from observations.
Summary
This paper addresses a fundamental challenge in inverse reinforcement learning: inferring reward functions from observed behavior when the agent's rationality level is unknown. The authors prove that it is impossible to uniquely decompose an agent's policy into a planning algorithm and reward function due to a No Free Lunch result, and that even with simplicity priors, multiple decompositions can produce similarly high regret. They argue that resolving this ambiguity requires normative assumptions that cannot be derived solely from behavioral observations, highlighting a previously underexplored but practically important limitation of IRL approaches.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Corrigibility Failure Pathways | Analysis | 62.0 |
| AI Alignment | Approach | 91.0 |
Cached Content Preview
# Occam’s razor is insufficient to infer the preferences of irrational agents
Stuart Armstrong \*
Future of Humanity Institute
University of Oxford
stuart.armstrong@philosophy.ox.ac.uk
&Sören Mindermann\*
Vector Institute
University of Toronto
soeren.mindermann@gmail.com
Equal contribution.Further affiliation: Machine Intelligence Research Institute, Berkeley, USA.Work performed at Future of Humanity Institute.
###### Abstract
Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings.
However, the general problem of inferring the reward function of an agent of unknown rationality has received little attention.
Unlike the well-known ambiguity problems in IRL, this one is practically relevant but cannot be resolved by observing the agent’s policy in enough environments.
This paper shows (1) that a No Free Lunch result implies it is impossible to uniquely decompose a policy into a planning algorithm and reward function, and (2) that even with a reasonable simplicity prior/Occam’s razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret.
To address this, we need simple ‘normative’ assumptions, which cannot be deduced exclusively from observations.
## 1 Introduction
In today’s reinforcement learning systems, a simple reward function is often hand-crafted, and still sometimes leads to undesired behaviors on the part of RL agent, as the reward function is not well aligned with the operator’s true goals111See for example the game CoastRunners, where an RL agent didn’t finish the course, but instead found a bug allowing it to get a high score by crashing round in circles [https://blog.openai.com/faulty-reward-functions/](https://blog.openai.com/faulty-reward-functions/ "").. As AI systems become more powerful and autonomous, these failures will become more frequent and grave as RL agents exceed human performance, operate at time-scales that forbid constant oversight, and are given increasingly complex tasks — from driving cars to planning cities to eventually evaluating policies or helping run companies. Ensuring that the agents behave in alignment with human values is known, appropriately, as the value alignment problem\[Amodei et al., [2016](https://ar5iv.labs.arxiv.org/html/1712.05812#bib.bib4 ""), Hadfield-Menell et al., [2016](https://ar5iv.labs.arxiv.org/html/1712.05812#bib.bib19 ""), Russell et al., [2015](https://ar5iv.labs.arxiv.org/html/1712.05812#bib.bib43 ""), Bostrom, [2014](https://ar5iv.labs.arxiv.org/html/1712.05812#bib.bib7 ""), Leike et al., [2017](https://ar5iv.labs.arxiv.org/html/1712.05812#bib.bib28 "")\].
One way of resolving this problem is to infer the correct reward function by observing human behaviour.
This is known as Inverse reinforcement learning (IRL) \[Ng
... (truncated, 89 KB total)6b7fc3f234fa109c | Stable ID: NGQyYmEwYj