Skip to content
Longterm Wiki
Back

ICLR 2021

paper

Authors

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Introduces ETHICS, a benchmark dataset for evaluating language models' understanding of moral concepts across justice, well-being, duties, virtues, and commonsense morality, directly addressing AI safety concerns about steering model outputs and regularizing RL agents.

Paper Details

Citations
841
123 influential
Year
2020

Metadata

conference paperprimary source

Abstract

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

Summary

This paper introduces ETHICS, a benchmark dataset for evaluating language models' understanding of moral concepts across justice, well-being, duties, virtues, and commonsense morality. The dataset requires models to predict human moral judgments about diverse text scenarios, combining physical and social world knowledge with value judgments. The authors find that current language models demonstrate promising but incomplete ability to predict ethical judgments, suggesting that progress on machine ethics is achievable and could help align AI systems with human values.

Cited by 2 pages

PageTypeQuality
FAR AIOrganization76.0
Dan HendrycksPerson19.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202696 KB
# Aligning AI With Shared Human Values

Dan Hendrycks

UC Berkeley
&Collin Burns\*

Columbia University
&Steven Basart

UChicago
\\ANDAndrew Critch

UC Berkeley
&Jerry Li

Microsoft
&Dawn Song

UC Berkeley
&Jacob Steinhardt

UC Berkeley\\ANDEqual Contribution.

###### Abstract

We show how to assess a language model’s knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents.
With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements.
Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

## 1 Introduction

Embedding ethics into AI systems remains an outstanding challenge without any concrete proposal.
In popular fiction, the “Three Laws of Robotics” plot device illustrates how simplistic rules cannot encode the complexity of human values (Asimov, [1950](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib4 "")).
Some contemporary researchers argue machine learning improvements need not lead to ethical AI, as raw intelligence is orthogonal to moral behavior (Armstrong, [2013](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib3 "")).
Others have claimed that machine ethics (Moor, [2006](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib48 "")) will be an important problem in the future, but it is outside the scope of machine learning today.
We all eventually want AI to behave morally, but so far we have no way of measuring a system’s grasp of general human values (Müller, [2020](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib50 "")).

The demand for ethical machine learning (White House, [2016](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib70 ""); European Commission, [2019](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib16 "")) has already led researchers to propose various ethical principles for narrow applications. To make algorithms more fair, researchers have proposed precise mathematical criteria. However, many of these fairness criteria have been shown to be mutually incompatible (Kleinberg et al., [2017](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib37 "")), and these rigid formalizations are task-specific and have been criticized for being simplistic. To make algorithms more safe, researchers have proposed specifying safety constraints (Ray et al., [2019](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib58 "")), but in the open world these rules may have many exceptions or require interpretation. To make algorithms prosocial, researchers have proposed

... (truncated, 96 KB total)
Resource ID: 57379f24535e9c04 | Stable ID: NDU1ZjIwN2