ICLR 2021
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Introduces ETHICS, a benchmark dataset for evaluating language models' understanding of moral concepts across justice, well-being, duties, virtues, and commonsense morality, directly addressing AI safety concerns about steering model outputs and regularizing RL agents.
Paper Details
Metadata
Abstract
We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
Summary
This paper introduces ETHICS, a benchmark dataset for evaluating language models' understanding of moral concepts across justice, well-being, duties, virtues, and commonsense morality. The dataset requires models to predict human moral judgments about diverse text scenarios, combining physical and social world knowledge with value judgments. The authors find that current language models demonstrate promising but incomplete ability to predict ethical judgments, suggesting that progress on machine ethics is achievable and could help align AI systems with human values.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| FAR AI | Organization | 76.0 |
| Dan Hendrycks | Person | 19.0 |
Cached Content Preview
# Aligning AI With Shared Human Values
Dan Hendrycks
UC Berkeley
&Collin Burns\*
Columbia University
&Steven Basart
UChicago
\\ANDAndrew Critch
UC Berkeley
&Jerry Li
Microsoft
&Dawn Song
UC Berkeley
&Jacob Steinhardt
UC Berkeley\\ANDEqual Contribution.
###### Abstract
We show how to assess a language model’s knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents.
With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements.
Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
## 1 Introduction
Embedding ethics into AI systems remains an outstanding challenge without any concrete proposal.
In popular fiction, the “Three Laws of Robotics” plot device illustrates how simplistic rules cannot encode the complexity of human values (Asimov, [1950](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib4 "")).
Some contemporary researchers argue machine learning improvements need not lead to ethical AI, as raw intelligence is orthogonal to moral behavior (Armstrong, [2013](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib3 "")).
Others have claimed that machine ethics (Moor, [2006](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib48 "")) will be an important problem in the future, but it is outside the scope of machine learning today.
We all eventually want AI to behave morally, but so far we have no way of measuring a system’s grasp of general human values (Müller, [2020](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib50 "")).
The demand for ethical machine learning (White House, [2016](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib70 ""); European Commission, [2019](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib16 "")) has already led researchers to propose various ethical principles for narrow applications. To make algorithms more fair, researchers have proposed precise mathematical criteria. However, many of these fairness criteria have been shown to be mutually incompatible (Kleinberg et al., [2017](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib37 "")), and these rigid formalizations are task-specific and have been criticized for being simplistic. To make algorithms more safe, researchers have proposed specifying safety constraints (Ray et al., [2019](https://ar5iv.labs.arxiv.org/html/2008.02275#bib.bib58 "")), but in the open world these rules may have many exceptions or require interpretation. To make algorithms prosocial, researchers have proposed
... (truncated, 96 KB total)57379f24535e9c04 | Stable ID: NDU1ZjIwN2