Model Evaluation for Extreme Risks
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Research paper addressing how model evaluation can identify dangerous capabilities in AI systems, particularly those posing extreme risks like cyber attacks or manipulation, critical for responsible AI development.
Paper Details
Metadata
Abstract
Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.
Summary
This paper addresses the critical role of model evaluation in mitigating extreme risks from advanced AI systems. As AI development progresses, general-purpose AI systems increasingly possess both beneficial and harmful capabilities, including potentially dangerous ones like offensive cyber abilities or manipulation skills. The authors argue that two types of evaluations are essential: dangerous capability evaluations to identify harmful capacities, and alignment evaluations to assess whether models are inclined to use their capabilities for harm. These evaluations are vital for informing policymakers and stakeholders, and for making responsible decisions regarding model training, deployment, and security.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| AI Safety Defense in Depth Model | Analysis | 69.0 |
| AI Risk Activation Timeline Model | Analysis | 66.0 |
| Governance-Focused Worldview | Concept | 67.0 |
Cached Content Preview
# Model evaluation for extreme risks
Toby Shevlane
Google DeepMind
Sebastian Farquhar
Google DeepMind
Ben Garfinkel
Centre for the Governance of AI
Mary Phuong
Google DeepMind
Jess Whittlestone
Centre for Long-Term Resilience
Jade Leung
OpenAI
Daniel Kokotajlo
OpenAI
Nahema Marchal
Google DeepMind
Markus Anderljung
Centre for the Governance of AI
Noam Kolt
University of Toronto
Lewis Ho
Google DeepMind
Divya Siddarth
University of Oxford
Collective Intelligence Project
Shahar Avin
University of Cambridge
Will Hawkins
Google DeepMind
Been Kim
Google DeepMind
Iason Gabriel
Google DeepMind
Vijay Bolina
Google DeepMind
Jack Clark
Anthropic
Yoshua Bengio
Université de Montréal
Mila – Quebec AI Institute
Paul Christiano
Alignment Research Center
Allan Dafoe
Google DeepMind
###### Abstract
Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through “dangerous capability evaluations”) and the propensity of models to apply their capabilities for harm (through “alignment evaluations”). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.
Figure 1: The theory of change for model evaluations for extreme risk. Evaluations for dangerous capabilities and alignment inform risk assessments, and are in turn embedded into important governance processes.
## 1 Introduction
As AI progress has advanced, general-purpose AI systems have tended to display new and hard-to-forecast capabilities – including harmful capabilities that their developers did not intend (Ganguli et al., [2022](https://ar5iv.labs.arxiv.org/html/2305.15324#bib.bib13 "")). Future systems may display even more dangerous emergent capabilities, such as the ability to conduct offensive cyber operations, manipulate people through conversation, or provide actionable instructions on conducting acts of terrorism.
AI developers and regulators must be able to identify these capabilities, if they want to limit the risks they pose. The AI community already relies heavily on model evaluation – i.e. empirical assessment of a model’s properties – for identifying and responding to a wide range of risks. Existing model evaluations measure gender and racial biases, truthfulness, toxicity, recitation of copyrighted content, and many more properties of models (Liang et al., [2022](https://ar5iv.labs.arxiv.org/html/2305.15324#bib.bib21 "")).
We propose extending this toolbox to address risks that would be extreme in scale, resulting from t
... (truncated, 65 KB total)490028792929073c | Stable ID: NGUzMTYxYT