Skip to content
Longterm Wiki
Back

Model Evaluation for Extreme Risks

paper

Authors

Toby Shevlane·Sebastian Farquhar·Ben Garfinkel·Mary Phuong·Jess Whittlestone·Jade Leung·Daniel Kokotajlo·Nahema Marchal·Markus Anderljung·Noam Kolt·Lewis Ho·Divya Siddarth·Shahar Avin·Will Hawkins·Been Kim·Iason Gabriel·Vijay Bolina·Jack Clark·Yoshua Bengio·Paul Christiano·Allan Dafoe

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Research paper addressing how model evaluation can identify dangerous capabilities in AI systems, particularly those posing extreme risks like cyber attacks or manipulation, critical for responsible AI development.

Paper Details

Citations
206
4 influential
Year
2023

Metadata

arxiv preprintprimary source

Abstract

Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

Summary

This paper addresses the critical role of model evaluation in mitigating extreme risks from advanced AI systems. As AI development progresses, general-purpose AI systems increasingly possess both beneficial and harmful capabilities, including potentially dangerous ones like offensive cyber abilities or manipulation skills. The authors argue that two types of evaluations are essential: dangerous capability evaluations to identify harmful capacities, and alignment evaluations to assess whether models are inclined to use their capabilities for harm. These evaluations are vital for informing policymakers and stakeholders, and for making responsible decisions regarding model training, deployment, and security.

Cited by 3 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 202665 KB
# Model evaluation for extreme risks

Toby Shevlane
Google DeepMind
Sebastian Farquhar
Google DeepMind
Ben Garfinkel
Centre for the Governance of AI
Mary Phuong
Google DeepMind
Jess Whittlestone
Centre for Long-Term Resilience
Jade Leung
OpenAI
Daniel Kokotajlo
OpenAI
Nahema Marchal
Google DeepMind
Markus Anderljung
Centre for the Governance of AI
Noam Kolt
University of Toronto
Lewis Ho
Google DeepMind
Divya Siddarth
University of Oxford
Collective Intelligence Project
Shahar Avin
University of Cambridge
Will Hawkins
Google DeepMind
Been Kim
Google DeepMind
Iason Gabriel
Google DeepMind
Vijay Bolina
Google DeepMind
Jack Clark
Anthropic
Yoshua Bengio
Université de Montréal
Mila – Quebec AI Institute
Paul Christiano
Alignment Research Center
Allan Dafoe
Google DeepMind

###### Abstract

Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through “dangerous capability evaluations”) and the propensity of models to apply their capabilities for harm (through “alignment evaluations”). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

![Refer to caption](https://ar5iv.labs.arxiv.org/html/2305.15324/assets/assets/image3.png)Figure 1: The theory of change for model evaluations for extreme risk. Evaluations for dangerous capabilities and alignment inform risk assessments, and are in turn embedded into important governance processes.

## 1 Introduction

As AI progress has advanced, general-purpose AI systems have tended to display new and hard-to-forecast capabilities – including harmful capabilities that their developers did not intend (Ganguli et al., [2022](https://ar5iv.labs.arxiv.org/html/2305.15324#bib.bib13 "")). Future systems may display even more dangerous emergent capabilities, such as the ability to conduct offensive cyber operations, manipulate people through conversation, or provide actionable instructions on conducting acts of terrorism.

AI developers and regulators must be able to identify these capabilities, if they want to limit the risks they pose. The AI community already relies heavily on model evaluation – i.e. empirical assessment of a model’s properties – for identifying and responding to a wide range of risks. Existing model evaluations measure gender and racial biases, truthfulness, toxicity, recitation of copyrighted content, and many more properties of models (Liang et al., [2022](https://ar5iv.labs.arxiv.org/html/2305.15324#bib.bib21 "")).

We propose extending this toolbox to address risks that would be extreme in scale, resulting from t

... (truncated, 65 KB total)
Resource ID: 490028792929073c | Stable ID: NGUzMTYxYT