Skip to content
Longterm Wiki
Back

Hadfield-Menell et al. (2016)

paper

Authors

Dylan Hadfield-Menell·Anca Dragan·Pieter Abbeel·Stuart Russell

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Foundational work on value alignment that formalizes the alignment problem as cooperative inverse reinforcement learning, proposing a framework where AI systems learn human values through interaction rather than explicit specification.

Paper Details

Citations
724
43 influential
Year
2016

Metadata

arxiv preprintprimary source

Abstract

For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans. We propose a formal definition of the value alignment problem as cooperative inverse reinforcement learning (CIRL). A CIRL problem is a cooperative, partial-information game with two agents, human and robot; both are rewarded according to the human's reward function, but the robot does not initially know what this is. In contrast to classical IRL, where the human is assumed to act optimally in isolation, optimal CIRL solutions produce behaviors such as active teaching, active learning, and communicative actions that are more effective in achieving value alignment. We show that computing optimal joint policies in CIRL games can be reduced to solving a POMDP, prove that optimality in isolation is suboptimal in CIRL, and derive an approximate CIRL algorithm.

Summary

This paper formalizes the value alignment problem in autonomous systems as Cooperative Inverse Reinforcement Learning (CIRL), where a robot and human jointly maximize the human's unknown reward function through cooperation. Unlike classical IRL where the human acts in isolation, CIRL enables optimal behaviors including active teaching, active learning, and communication that facilitate value alignment. The authors prove that individual optimality is suboptimal in cooperative settings, reduce CIRL to POMDP solving, and provide an approximate algorithm for computing optimal joint policies.

Cited by 4 pages

PageTypeQuality
Center for Human-Compatible AIOrganization37.0
Agent FoundationsApproach59.0
Cooperative IRL (CIRL)Approach65.0
Cooperative AIApproach55.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202652 KB
# Cooperative Inverse Reinforcement Learning

Dylan Hadfield-Menell           Anca Dragan
          Pieter Abbeel           Stuart Russell

Electrical Engineering and Computer Science

University of California at Berkeley

Berkeley, CA 94709

{dhm, anca, pabbeel, russell}@cs.berkeley.edu

###### Abstract

For an autonomous system to be helpful to humans and to pose no
unwarranted risks, it needs to align its values with those of the
humans in its environment in such a way that its actions contribute to
the maximization of value for the humans. We propose a formal
definition of the value alignment problem as cooperative inverse
reinforcement learning (CIRL). A CIRL problem is a cooperative,
partial-information game with two agents, human and robot; both are
rewarded according to the human’s reward function, but the robot does
not initially know what this is. In contrast to classical IRL, where
the human is assumed to act optimally in isolation, optimal CIRL
solutions produce behaviors such as active teaching, active learning,
and communicative actions that are more effective in achieving value
alignment. We show that computing optimal joint policies in CIRL games
can be reduced to solving a POMDP, prove that optimality in isolation
is suboptimal in CIRL, and derive an approximate CIRL algorithm.

## 1 Introduction

“If we use, to achieve our purposes, a mechanical agency with
whose operation we cannot interfere effectively ……\\ldots we had better
be quite sure that the purpose put into the machine is the purpose
which we really desire.” So wrote Norbert
Wiener ([1960](https://ar5iv.labs.arxiv.org/html/1606.03137#bib.bib25 "")) in one of the earliest explanations of
the problems that arise when a powerful autonomous system operates
with an incorrect objective. This value alignment problem is far
from trivial. Humans are prone to mis-stating their objectives, which
can lead to unexpected implementations. In the myth of King Midas, the
main character learns that wishing for ‘everything he touches to turn
to gold’ leads to disaster. In a reinforcement learning
context, Russell & Norvig ( [2010](https://ar5iv.labs.arxiv.org/html/1606.03137#bib.bib22 "")) describe a seemingly reasonable,
but incorrect, reward function for a vacuum robot: if we reward the
action of cleaning up dirt, the optimal policy causes the robot to
repeatedly dump and clean up the same dirt.

A solution to the value alignment problem has long-term implications
for the future of AI and its relationship to
humanity (Bostrom, [2014](https://ar5iv.labs.arxiv.org/html/1606.03137#bib.bib4 "")) and short-term utility
for the design of usable AI systems. Giving robots the right
objectives and enabling them to make the right trade-offs is crucial
for self-driving cars, personal assistants, and human–robot
interaction more broadly.

The field of _inverse reinforcement learning_ or
IRL (Russell, [1998](https://ar5iv.labs.arxiv.org/html/1606.03137#bib.bib23 ""); Ng & Russell, [2000](https://ar

... (truncated, 52 KB total)
Resource ID: 821f65afa4c681ca | Stable ID: MTYzMjM2OG