Skip to content
Longterm Wiki
Back

Concrete Problems in AI Safety

paper

Authors

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Widely considered one of the most influential foundational papers in technical AI safety; frequently cited as a key reference for the research agenda pursued by groups like OpenAI, Anthropic, and DeepMind safety teams.

Paper Details

Citations
2,962
135 influential
Year
2016
Methodology
survey

Metadata

Importance: 95/100arxiv preprintprimary source

Abstract

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

Summary

This foundational paper by Amodei et al. identifies five practical AI safety research problems: avoiding side effects, avoiding reward hacking, scalable oversight, safe exploration, and robustness to distributional shift. It frames these as concrete technical challenges arising from real-world ML system design, providing a research agenda that has significantly shaped the field of AI safety.

Key Points

  • Identifies five core accident risk categories: side effects, reward hacking, scalable supervision, safe exploration, and distributional shift.
  • Distinguishes root causes: wrong objective functions, objectives too costly to evaluate frequently, and undesirable learning-process behavior.
  • Proposes concrete research directions grounded in real ML systems rather than speculative future AI.
  • Introduces influential concepts like reward hacking and scalable oversight that became central to subsequent AI safety research.
  • Co-authored by researchers from OpenAI and Google Brain, lending significant credibility and helping establish AI safety as a legitimate research area.

Cited by 8 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Concrete Problems in AI Safety

Dario Amodei

Google Brain

These authors contributed equally.Chris Olah††footnotemark:

Google Brain

Jacob Steinhardt

Stanford University

Paul Christiano

UC Berkeley

John Schulman

OpenAI

Dan Mané

Google Brain

###### Abstract

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of _accidents_ in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function (“avoiding side effects” and “avoiding reward hacking”), an objective function that is too expensive to evaluate frequently (“scalable supervision”), or undesirable behavior during the learning process (“safe exploration” and “distributional shift”). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

## 1 Introduction

The last few years have seen rapid progress on long-standing, difficult problems in machine learning and artificial intelligence (AI), in areas as diverse as computer vision \[ [82](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx82 "")\], video game playing \[ [102](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx102 "")\], autonomous vehicles \[ [86](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx86 "")\], and Go \[ [140](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx140 "")\]. These advances have brought excitement about the positive potential for AI to transform medicine \[ [126](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx126 "")\], science \[ [59](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx59 "")\], and transportation \[ [86](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx86 "")\], along with concerns about the privacy \[ [76](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx76 "")\], security \[ [115](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx115 "")\], fairness \[ [3](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx3 "")\], economic \[ [32](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx32 "")\], and military \[ [16](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx16 "")\] implications of autonomous systems, as well as concerns about the longer-term implications of powerful AI \[ [27](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx27 ""), [167](https://ar5iv.labs.arxiv.org/html/1606.06565#bib.bibx167 "")\].

The authors believe that AI technologies are likely to be overwhelmingly beneficial for humanity, but we also believe th

... (truncated, 98 KB total)
Resource ID: cd3035dbef6c7b5b | Stable ID: M2JlZDZhMj