Skip to content
Longterm Wiki
Back

AI Safety Gridworlds

paper

Authors

Jan Leike·Miljan Martic·Victoria Krakovna·Pedro A. Ortega·Tom Everitt·Andrew Lefrancq·Laurent Orseau·Shane Legg

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Presents a suite of RL environments for empirically evaluating AI safety properties including safe interruptibility, side effects, reward gaming, and robustness—providing concrete benchmarks for measuring compliance with intended safe behavior.

Paper Details

Citations
283
20 influential
Year
2017
Methodology
book-chapter
Categories
Lecture Notes in Computer Science

Metadata

arxiv preprintprimary source

Abstract

We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. These problems include safe interruptibility, avoiding side effects, absent supervisor, reward gaming, safe exploration, as well as robustness to self-modification, distributional shift, and adversaries. To measure compliance with the intended safe behavior, we equip each environment with a performance function that is hidden from the agent. This allows us to categorize AI safety problems into robustness and specification problems, depending on whether the performance function corresponds to the observed reward function. We evaluate A2C and Rainbow, two recent deep reinforcement learning agents, on our environments and show that they are not able to solve them satisfactorily.

Summary

This paper introduces AI Safety Gridworlds, a suite of reinforcement learning environments designed to test and measure various safety properties of intelligent agents. The environments address critical safety challenges including safe interruptibility, side effect avoidance, reward gaming, and robustness to distributional shift and adversarial attacks. Each environment includes a hidden performance function to distinguish between robustness problems (where the true objective differs from observed rewards) and specification problems. Evaluation of state-of-the-art deep RL agents (A2C and Rainbow) demonstrates that current methods fail to reliably solve these safety-critical tasks.

Cited by 1 page

PageTypeQuality
Google DeepMindOrganization37.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202689 KB
# AI Safety Gridworlds

Jan Leike

DeepMind
&Miljan Martic

DeepMind
&Victoria Krakovna

DeepMind
&Pedro A. Ortega

DeepMind
&Tom Everitt

DeepMind

Australian National University
&Andrew Lefrancq

DeepMind
&Laurent Orseau

DeepMind
&Shane Legg

DeepMind

###### Abstract

We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents.
These problems include safe interruptibility, avoiding side effects, absent supervisor, reward gaming, safe exploration, as well as robustness to self-modification, distributional shift, and adversaries.
To measure compliance with the intended safe behavior,
we equip each environment with a _performance function_ that is hidden from the agent.
This allows us to categorize AI safety problems into _robustness_ and _specification_ problems,
depending on whether the performance function corresponds to the observed reward function.
We evaluate A2C and Rainbow, two recent deep reinforcement learning agents,
on our environments and show that they are not able to solve them satisfactorily.

## 1 Introduction

Expecting that more advanced versions of today’s AI systems are going to be deployed in real-world applications,
numerous public figures have advocated more research into the safety of these systems (Bostrom, [2014](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib16 ""); Hawking et al., [2014](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib41 ""); Russell, [2016](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib68 "")).
This nascent field of _AI safety_ still lacks a general consensus on its research problems,
and there have been several recent efforts to turn these concerns into technical problems on which we can make direct progress (Soares and Fallenstein, [2014](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib75 ""); Russell et al., [2015](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib69 ""); Taylor et al., [2016](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib82 ""); Amodei et al., [2016](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib5 "")).

Empirical research in machine learning has often been accelerated by the availability of the right data set.
MNIST (LeCun, [1998](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib49 "")) and ImageNet (Deng et al., [2009](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib24 "")) have had a large impact on the progress on supervised learning.
Scalable reinforcement learning research has been spurred by environment suites such as
the Arcade Learning Environment (Bellemare et al., [2013](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib12 "")),
OpenAI Gym (Brockman et al., [2016](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib17 "")), DeepMind Lab (Beattie et al., [2016](https://ar5iv.labs.arxiv.org/html/1711.09883#bib.bib11 "")), and others.
However, to this date there has not yet been a comprehensive environment suite for AI safety problems.

With this paper, we aim to lay the 

... (truncated, 89 KB total)
Resource ID: 84527d3e1671495f | Stable ID: NDY5OGY3NG