Berkeley CLTC Working Paper on Intolerable Risk Thresholds

web

cltc.berkeley.edu·cltc.berkeley.edu/wp-content/uploads/2024/11/Working-Pape...

Published November 2024 by Berkeley's CLTC, this working paper is relevant to policymakers and safety researchers working on AI red lines, model evaluations, and governance frameworks for frontier AI systems.

Metadata

Importance: 62/100working paperanalysis

Summary

This Berkeley Center for Long-Term Cybersecurity working paper examines how to define and operationalize 'intolerable risk' thresholds for AI systems, providing a framework for identifying which AI capabilities or behaviors should be categorically prohibited or constrained. It contributes to the growing policy and technical discourse around AI red lines and safety limits.

Key Points

•Proposes frameworks for defining 'intolerable risk' thresholds that can guide AI governance and deployment decisions
•Distinguishes between quantitative and qualitative approaches to setting risk limits for AI systems
•Examines how threshold-based thinking can inform regulatory frameworks and voluntary developer commitments
•Addresses the challenge of operationalizing abstract safety concepts into concrete, evaluable criteria
•Connects threshold-setting to broader AI safety evaluations and frontier model governance efforts

Cited by 1 page

Page	Type	Quality
AI Capability Threshold Model	Analysis	72.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202687 KB

# Toward Thresholds for Intolerable Risks Posed by Frontier AI Models

WORKING PAPER

# Provisional Recommendations and Considerations for Intolerable Risk Thresholds to Inform Frontier AI Safety Frameworks

Anthony M. Barrett, Jessica Newman, Deepika Raman, Nada Madkour, Evan R. Murphy UC Berkeley Center for Long-Term Cybersecurity

15 November 2024

Affiliations listed above are current, or were during the authors’ main contributions to this work. Views expressed here are those of the authors, and do not represent views of UC Berkeley nor others.

# Table of Contents

# Executive Summary.........

1. Background..

1.1 The Call for Defining Intolerable Risk Thresholds . .6

1.2 Existing Intolerable Risk Thresholds. 6

# 2\. Proposal to Advance Intolerable Risk Thresholds ....

2.1 Recommended Inclusions for Intolerable Risks .8

Persuasion . ..8

Deception .. .9

2.2 Thresholds for Key Intolerable Risk Categories.. .10

2.3 Additional Intolerable Risks . .14

2.3.1 Unacceptable Uses.. ....... .14

2.3.2 Unacceptable Limitations. ..... .16

2.3.3 Unacceptable Impacts .. .16

2.4 Key Considerations ... .17

3. Limitations..... .18

References .... .20

Appendices ...... .28

Appendix A: Types of Thresholds for Frontier Models .28

Appendix B: Examples of Quantitative Capability Thresholds.. .29

Appendix C: Key Principles to Determine Thresholds.. .32

# Executive Summary

AI foundation models, especially increasingly advanced frontier models, may pose intolerable risks. For example, frontier models may lower barriers to a terrorist, state-affiliated threat actor, or other adversary seeking to cause high-impact events such as CBRN or CBRNE (chemical, biological, radiological, nuclear, explosive) attacks or cyber-attacks. In this paper, we consider intolerable risks as those that have the potential for severe or catastrophic impact; have relevance to current and emerging AI capabilities; have the likelihood of irreversibility; and have a short timescale of expected impact.

A number of frameworks for pre-release risk assessment and decision-making include AI dualuse capability evaluation and some form of explicit or implicit threshold for a dual-use capability hazard that should be regarded as intolerable. However, such thresholds are often defined at a high level, using qualitative language, which may not be readily compared to results of dual-use capability assessments, such as from red-teaming-based evaluations. Model developers and evaluators may be left without a reasonably clear, consistent, and operationalized answer to the question, "How much lowering of barriers is too much?"

Now is the time to consider that question. One reason is to inform Frontier AI Safety Commitments work by AI industry organizations on defining intolerable risk thresholds (DSIT 2024a) before the AI Action Summit in France to be held February 10-11, 2025. In the absence of clear guidance from regulators, academics, or civil society that place a high priority on prote

... (truncated, 87 KB total)

Resource ID: d1774c2286e7c730 | Stable ID: sid_IDDVJXkIWH