Skip to content
Longterm Wiki
Back

MACHIAVELLI dataset

paper

Authors

Alexander Pan·Jun Shern Chan·Andy Zou·Nathaniel Li·Steven Basart·Thomas Woodside·Jonathan Ng·Hanlin Zhang·Scott Emmons·Dan Hendrycks

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

MACHIAVELLI is a benchmark dataset of 134 Choose-Your-Own-Adventure games designed to measure Machiavellian behaviors (power-seeking, deception) in AI agents and language models, addressing concerns about unaligned incentives in reward maximization and next-token prediction.

Paper Details

Citations
192
17 influential
Year
2023

Metadata

arxiv preprintdataset

Abstract

Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.

Summary

MACHIAVELLI is a benchmark dataset of 134 Choose-Your-Own-Adventure games with over 500,000 scenarios designed to evaluate whether AI agents naturally learn Machiavellian behaviors like power-seeking, deception, and ethical violations when trained to maximize reward. The authors use language models for automated scenario labeling and mathematize dozens of harmful behaviors to evaluate agents' tendencies. Their findings reveal a tension between reward maximization and ethical behavior, but demonstrate that agents can be steered toward less harmful actions through LM-based methods, suggesting that designing agents that are simultaneously safe and capable is achievable.

Cited by 2 pages

PageTypeQuality
Center for AI SafetyOrganization42.0
Alignment EvaluationsApproach65.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Do the Rewards Justify the Means? Measuring Trade-Offs Between    Rewards and Ethical Behavior in the Machiavelli Benchmark

Alexander Pan
Jun Shern Chan
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks

###### Abstract

Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce Machiavelli, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents’ tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics–designing agents that are Pareto improvements in both safety and capabilities.

machine ethics, ai safety, text-based reinforcement learning, alignment

## 1 Introduction

AI systems are rapidly gaining capabilities (OpenAI, [2023](https://ar5iv.labs.arxiv.org/html/2304.03279#bib.bib53 "")), especially in natural language (Bubeck et al., [2023](https://ar5iv.labs.arxiv.org/html/2304.03279#bib.bib13 "")). To mitigate risks of deployment, models must be thoroughly evaluated and steered towards safer behaviors (Hendrycks et al., [2021b](https://ar5iv.labs.arxiv.org/html/2304.03279#bib.bib38 "")). Previous benchmarks aimed at evaluating these complex systems have measured language understanding  (Wang et al., [2019](https://ar5iv.labs.arxiv.org/html/2304.03279#bib.bib73 "")) or reasoning in isolated scenarios (Srivastava et al., [2022](https://ar5iv.labs.arxiv.org/html/2304.03279#bib.bib68 ""); Liang et al., [2022](https://ar5iv.labs.arxiv.org/html/2304.03279#bib.bib42 "")). However, models are now being trained for real-world, interactive tasks (Ahn et al., [2022](https://ar5iv.labs.arxiv.org/html/2304.03279#bib.bib3 ""); Reed et al., [2022](https://ar5iv.labs.arxiv.org/html/2304.03279#bib.bib61 "")). Thus, benchmarks should assess how models behave in interactive environments.

![Refer to caption](https://ar5iv.labs.arxiv.org/html/2304.03279/assets/x1.png)Figure 1: Across diverse games and objectives in Machiavelli, agents trained to maximize reward tend do so via Machiavellian means. The reward-maximizing RL agent (dotted blue) is less moral, less concerned about wellbeing, and less 

... (truncated, 98 KB total)
Resource ID: 6d4e8851e33e1641 | Stable ID: Y2M1MDUyMW