Model Evaluation for Extreme Risks

paper

2023·arXiv·arxiv.org/abs/2305.15324

Authors

Toby Shevlane·Sebastian Farquhar·Ben Garfinkel·Mary Phuong·Jess Whittlestone·Jade Leung·Daniel Kokotajlo·Nahema Marchal·Markus Anderljung·Noam Kolt·Lewis Ho·Divya Siddarth·Shahar Avin·Will Hawkins·Been Kim·Iason Gabriel·Vijay Bolina·Jack Clark·Yoshua Bengio·Paul Christiano·Allan Dafoe

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Research paper addressing how model evaluation can identify dangerous capabilities in AI systems, particularly those posing extreme risks like cyber attacks or manipulation, critical for responsible AI development.

Paper Details

Citations

206

4 influential

Year

2023

arXiv:2305.15324 DOI:10.48550/arXiv.2305.15324 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

Summary

This paper addresses the critical role of model evaluation in mitigating extreme risks from advanced AI systems. As AI development progresses, general-purpose AI systems increasingly possess both beneficial and harmful capabilities, including potentially dangerous ones like offensive cyber abilities or manipulation skills. The authors argue that two types of evaluations are essential: dangerous capability evaluations to identify harmful capacities, and alignment evaluations to assess whether models are inclined to use their capabilities for harm. These evaluations are vital for informing policymakers and stakeholders, and for making responsible decisions regarding model training, deployment, and security.

Cited by 3 pages

Page	Type	Quality
AI Safety Defense in Depth Model	Analysis	69.0
AI Risk Activation Timeline Model	Analysis	66.0
Governance-Focused Worldview	Concept	67.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202660 KB

[2305.15324] Model evaluation for extreme risks 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Model evaluation for extreme risks

 
 
 Toby Shevlane
 
 Google DeepMind
 
 
 Sebastian Farquhar
 
 Google DeepMind
 
 
 Ben Garfinkel
 
 Centre for the Governance of AI
 
 
 Mary Phuong
 
 Google DeepMind
 
 
 Jess Whittlestone
 
 Centre for Long-Term Resilience
 
 
 Jade Leung
 
 OpenAI
 
 
 Daniel Kokotajlo
 
 OpenAI
 
 
 Nahema Marchal
 
 Google DeepMind
 
 
 Markus Anderljung
 
 Centre for the Governance of AI
 
 
 Noam Kolt
 
 University of Toronto
 
 
 Lewis Ho
 
 Google DeepMind
 
 
 Divya Siddarth
 
 University of Oxford
 
 Collective Intelligence Project
 
 
 Shahar Avin
 
 University of Cambridge
 
 
 Will Hawkins
 
 Google DeepMind
 
 
 Been Kim
 
 Google DeepMind
 
 
 Iason Gabriel
 
 Google DeepMind
 
 
 Vijay Bolina
 
 Google DeepMind
 
 
 Jack Clark
 
 Anthropic
 
 
 Yoshua Bengio
 
 Université de Montréal
 
 Mila – Quebec AI Institute
 
 
 Paul Christiano
 
 Alignment Research Center
 
 
 Allan Dafoe
 
 Google DeepMind
 
 

 
 Abstract

 Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through “dangerous capability evaluations”) and the propensity of models to apply their capabilities for harm (through “alignment evaluations”). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

 
 
 Figure 1: The theory of change for model evaluations for extreme risk. Evaluations for dangerous capabilities and alignment inform risk assessments, and are in turn embedded into important governance processes. 
 
 
 
 1 Introduction

 
 As AI progress has advanced, general-purpose AI systems have tended to display new and hard-to-forecast capabilities – including harmful capabilities that their developers did not intend (Ganguli et al., 2022 ) . Future systems may display even more dangerous emergent capabilities, such as the ability to conduct offensive cyber operations, manipulate people through conversation, or provide actionable instructions on conducting acts of terrorism.

 
 
 AI developers and regulators must be able to identify these capabilities, if they want to limit the risks they pose. The AI community already relies heavily on model evaluation – i.e. empirical assessment of a model’s properties – for identifying and responding to a wide range of risks. Existing model evaluations measure gender and racial biases, truthfulness, toxicity, recitation of copyrighted content, and many more properties of models (Liang et al., 2022 ) .

 
 
 We propose extending t

... (truncated, 60 KB total)

Resource ID: 490028792929073c | Stable ID: sid_j4he6HnbnQ