ICLR 2021

paper

2020·arXiv·arxiv.org/abs/2008.02275

Authors

Dan Hendrycks·Collin Burns·Steven Basart·Andrew Critch·Jerry Li·Dawn Song·Jacob Steinhardt

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Introduces ETHICS, a benchmark dataset for evaluating language models' understanding of moral concepts across justice, well-being, duties, virtues, and commonsense morality, directly addressing AI safety concerns about steering model outputs and regularizing RL agents.

Paper Details

Citations

841

123 influential

Year

2020

arXiv:2008.02275 Semantic Scholar

Metadata

conference paperprimary source

Abstract

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

Summary

This paper introduces ETHICS, a benchmark dataset for evaluating language models' understanding of moral concepts across justice, well-being, duties, virtues, and commonsense morality. The dataset requires models to predict human moral judgments about diverse text scenarios, combining physical and social world knowledge with value judgments. The authors find that current language models demonstrate promising but incomplete ability to predict ethical judgments, suggesting that progress on machine ethics is achievable and could help align AI systems with human values.

Cited by 2 pages

Page	Type	Quality
FAR AI	Organization	76.0
Dan Hendrycks	Person	19.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202686 KB

[2008.02275] Aligning AI With Shared Human Values 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Aligning AI With Shared Human Values

 
 
 Dan Hendrycks
 UC Berkeley
&Collin Burns * 
 Columbia University
&Steven Basart 
 UChicago
 \AND Andrew Critch 
 UC Berkeley
&Jerry Li
 Microsoft
&Dawn Song
 UC Berkeley
&Jacob Steinhardt
 UC Berkeley \AND 
 Equal Contribution. 
 

 
 Abstract

 We show how to assess a language model’s knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents.
With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements.
Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

 
 
 
 1 Introduction

 
 Embedding ethics into AI systems remains an outstanding challenge without any concrete proposal.
In popular fiction, the “Three Laws of Robotics” plot device illustrates how simplistic rules cannot encode the complexity of human values (Asimov, 1950 ) .
Some contemporary researchers argue machine learning improvements need not lead to ethical AI, as raw intelligence is orthogonal to moral behavior (Armstrong, 2013 ) .
Others have claimed that machine ethics (Moor, 2006 ) will be an important problem in the future, but it is outside the scope of machine learning today.
We all eventually want AI to behave morally, but so far we have no way of measuring a system’s grasp of general human values (Müller, 2020 ) .

 
 
 The demand for ethical machine learning (White House, 2016 ; European Commission, 2019 ) has already led researchers to propose various ethical principles for narrow applications. To make algorithms more fair , researchers have proposed precise mathematical criteria. However, many of these fairness criteria have been shown to be mutually incompatible (Kleinberg et al., 2017 ) , and these rigid formalizations are task-specific and have been criticized for being simplistic. To make algorithms more safe , researchers have proposed specifying safety constraints (Ray et al., 2019 ) , but in the open world these rules may have many exceptions or require interpretation. To make algorithms prosocial , researchers have proposed imitating temperamental traits such as empathy (Rashkin et al., 2019 ; Roller et al., 2020 ) , but these have been limited to specific character traits in particular application areas such as chatbots (Krause et al., 2020 ) . Finally, to make algorithms promote utility , researchers have proposed learning human preferences, but only for closed-world tas

... (truncated, 86 KB total)

Resource ID: 57379f24535e9c04 | Stable ID: sid_Fz4nsGXlmr