MACHIAVELLI dataset

paper

2023·arXiv·arxiv.org/abs/2304.03279

Authors

Alexander Pan·Jun Shern Chan·Andy Zou·Nathaniel Li·Steven Basart·Thomas Woodside·Jonathan Ng·Hanlin Zhang·Scott Emmons·Dan Hendrycks

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

MACHIAVELLI is a benchmark dataset of 134 Choose-Your-Own-Adventure games designed to measure Machiavellian behaviors (power-seeking, deception) in AI agents and language models, addressing concerns about unaligned incentives in reward maximization and next-token prediction.

Paper Details

Citations

192

17 influential

Year

2023

arXiv:2304.03279 DOI:10.48550/arXiv.2304.03279 Semantic Scholar

Metadata

arxiv preprintdataset

Abstract

Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.

Summary

MACHIAVELLI is a benchmark dataset of 134 Choose-Your-Own-Adventure games with over 500,000 scenarios designed to evaluate whether AI agents naturally learn Machiavellian behaviors like power-seeking, deception, and ethical violations when trained to maximize reward. The authors use language models for automated scenario labeling and mathematize dozens of harmful behaviors to evaluate agents' tendencies. Their findings reveal a tension between reward maximization and ethical behavior, but demonstrate that agents can be steered toward less harmful actions through LM-based methods, suggesting that designing agents that are simultaneously safe and capable is achievable.

Cited by 2 pages

Page	Type	Quality
Center for AI Safety (CAIS)	Organization	42.0
Alignment Evaluations	Approach	65.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2304.03279] Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 Do the Rewards Justify the Means? Measuring Trade-Offs Between 
 Rewards and Ethical Behavior in the Machiavelli  Benchmark

 
 
 Alexander Pan
 
    
 Jun Shern Chan
 
    
 Andy Zou
 
    
 Nathaniel Li
 
    
 Steven Basart
 
    
 Thomas Woodside
 
    
 Jonathan Ng
 
    
 Hanlin Zhang
 
    
 Scott Emmons
 
    
 Dan Hendrycks
 
 

 
 Abstract

 Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce Machiavelli , a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents’ tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics–designing agents that are Pareto improvements in both safety and capabilities.

 
 machine ethics, ai safety, text-based reinforcement learning, alignment
 
 
 
 
 
 
 1 Introduction

 
 AI systems are rapidly gaining capabilities  (OpenAI, 2023 ) , especially in natural language  (Bubeck et al., 2023 ) . To mitigate risks of deployment, models must be thoroughly evaluated and steered towards safer behaviors  (Hendrycks et al., 2021b ) . Previous benchmarks aimed at evaluating these complex systems have measured language understanding   (Wang et al., 2019 ) or reasoning in isolated scenarios  (Srivastava et al., 2022 ; Liang et al., 2022 ) . However, models are now being trained for real-world, interactive tasks  (Ahn et al., 2022 ; Reed et al., 2022 ) . Thus, benchmarks should assess how models behave in interactive environments.

 
 
 Figure 1: Across diverse games and objectives in Machiavelli , agents trained to maximize reward tend do so via Machiavellian means. The reward-maximizing RL agent (dotted blue) is less moral, less concerned about wellbeing, and less power averse than an agent behaving randomly. We find that simple techniques can increase ethical behavior (solid lines) opening up the possibility for further improvements. 
 
 
 Text-based games are a natural test-bed for evaluating interactive agents. Progressing thro

... (truncated, 98 KB total)

Resource ID: 6d4e8851e33e1641 | Stable ID: sid_TqdqNKH5H7