New safety research agenda: scalable agent alignment via reward modeling

web

2018·LessWrong·lesswrong.com/posts/HBGd34LKvXM9TxvNf/new-safety-research...

Author

Vika

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

A 2018 LessWrong linkpost summarizing DeepMind's formal research agenda on reward modeling; notable as an early institutional safety agenda and for community discussion comparing it to Christiano's iterated amplification work.

Forum Post Details

Karma

Comments

Forum

lesswrong

Forum Tags

Research AgendasAI

Metadata

Importance: 72/100blog postprimary source

Summary

DeepMind's 2018 safety research agenda proposes reward modeling as a scalable approach to agent alignment, separating learning what to do (reward model trained on human feedback) from learning how to do it (RL policy maximizing learned reward). The agenda outlines a path from near-term narrow domains to long-term complex tasks requiring superhuman understanding, building on earlier work with human preferences and demonstrations.

Key Points

•Frames alignment as training agents to behave according to user intentions using interaction protocols (demonstrations, preferences, reward functions) rather than pre-defined rewards.
•Core approach separates reward modeling (learning what to do from human feedback) from policy optimization (RL to maximize learned reward).
•Cites prior empirical work: teaching backflips from preferences, object arrangement from goal examples, Atari from preferences and demonstrations.
•Aims to scale from simple near-term tasks to abstract, complex domains eventually requiring superhuman-level understanding.
•Discussion highlights overlap with Paul Christiano's agenda, raising questions about novelty and relationship to iterated amplification approaches.

Cited by 2 pages

Page	Type	Quality
AI-Assisted Alignment	Approach	63.0
Scalable Oversight	Research Area	68.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20264 KB

# New safety research agenda: scalable agent alignment via reward modeling
By Vika
Published: 2018-11-20
Jan Leike and others from the DeepMind safety team have released a new [](https://medium.com/@deepmindsafetyresearch/scalable-agent-alignment-via-reward-modeling-bf4ab06dfd84) [research agenda](https://arxiv.org/abs/1811.07871) on reward learning:

"Ultimately, the goal of AI progress is to benefit humans by enabling us to address increasingly complex challenges in the real world. But **the real world does not come with built-in reward functions**. This presents some challenges because performance on these tasks is not easily defined. We need a good way to provide feedback and enable artificial agents to reliably understand what we want, in order to help us achieve it. In other words, we want to train AI systems with human feedback in such a way that the system’s behavior _aligns_ with our intentions. For our purposes, we define the **agent alignment problem** as follows:

> _How can we create agents that behave in accordance with the user’s intentions?_

The alignment problem can be framed in the reinforcement learning framework, except that instead of receiving a numeric _reward signal_, the agent can interact with the user via an interaction protocol that allows the user to communicate their intention to the agent. This protocol can take many forms: the user can provide [demonstrations](https://en.wikipedia.org/wiki/Apprenticeship_learning), [preferences](http://www.jmlr.org/papers/v18/16-634.html), [optimal actions](http://papers.nips.cc/paper/5187-policy-shaping-integrating-human-feedback-with-reinforcement-learning), or [communicate a reward function](https://arxiv.org/abs/1711.02827), for example. **A solution to the agent alignment problem is a policy that behaves in accordance with the user’s intentions**.

With our **[new paper](https://arxiv.org/abs/1811.07871)** we outline a research direction for tackling the agent alignment problem head-on. Building on our earlier [categorization of AI safety problems](https://medium.com/@deepmindsafetyresearch/building-safe-artificial-intelligence-52f5f75058f1) as well as [numerous](https://www.futureoflife.org/data/documents/research_priorities.pdf) [problem](https://ai-alignment.com/directions-and-desiderata-for-ai-control-b60fca0da8f4) [expositions](https://intelligence.org/files/AlignmentMachineLearning.pdf) [on](https://arxiv.org/abs/1711.09883) [AI safety](https://arxiv.org/abs/1606.06565), we paint a coherent picture of how progress in these areas could yield a solution to the agent alignment problem. This opens the door to building systems that can better understand how to interact with users, learn from their feedback, and predict their preferences — both in narrow, simpler domains in the near term, and also more complex and abstract domains that require understanding beyond human level in the longer term.

The main thrust of our research direction is based on **reward modeling**: we tr

... (truncated, 4 KB total)

Resource ID: 50127ce5fac4e84b | Stable ID: sid_DO9cxnAnZk