Beth Barnes - Safety evaluations and standards for AI (EA Forum)

web

2023·EA Forum·forum.effectivealtruism.org/posts/49rzRKh2ZYH2QjPkg/safet...

Author

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: EA Forum

Transcript and video of Beth Barnes's EAG 2023 talk outlining ARC Evals' strategy; directly relevant to the emerging field of pre-deployment safety evaluations and the development of industry standards around dangerous capability thresholds.

Forum Post Details

Karma

Comments

Forum

eaforum

Forum Tags

AI safetyAI evaluations and standardsAlignment Research Center

Metadata

Importance: 72/100blog postprimary source

Summary

Beth Barnes of ARC Evals presents the case for high-effort, targeted safety evaluations as a promising intervention for reducing existential risk from AI. She focuses on the 'autonomous replication' threshold as a key danger signal and explains how evaluations can create accountability mechanisms that encourage labs to meet safety standards before deployment.

Key Points

•ARC Evals focuses on high-effort, human-intensive evaluations targeting 'when would we know this model is an x-risk?' rather than automated benchmarks.
•The 'autonomous replication' threshold—whether a model can self-replicate and acquire resources autonomously—is proposed as a key danger signal for deployment decisions.
•Evaluations are framed as a governance lever: creating standards that labs can agree to, linking safety assessments to deployment decisions.
•A core challenge is that evaluations must be anticipatory, identifying early warning signs before dangerous capabilities fully emerge in deployed models.
•The approach addresses alignment indirectly by establishing external accountability rather than solving alignment technically.

Cited by 1 page

Page	Type	Quality
METR	Organization	66.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202633 KB

# Safety evaluations and standards for AI | Beth Barnes | EAG Bay Area 23
By Beth Barnes
Published: 2023-06-16
This is a video and transcript for a talk given at EA Global, Bay Area 2023.

You can see the slides for this talk [here](https://docs.google.com/presentation/d/1-aGGxEF3HSGMvBeXkEmEADZLRPWO30Pb/edit?usp=sharing&ouid=114907372317547682074&rtpof=true&sd=true)

* * *

Here's the full transcript with slides:

Hey everyone, thanks for coming. So, I am going to be talking about evaluations - why I think evaluations are a promising intervention for reducing existential risk from AI. First, I'm just going to give a little bit of background and situation and what we mean by this evals thing. Then, I'm going to motivate why this might be a promising way to reduce AI x-risk, explain why we feel like we're getting some amount of traction with our current projects, and finally answer some questions I often get asked about eval stuff. 

**Background**
==============

So, background: what do I mean by evals? But before we start, I'm going to basically assume that everyone is familiar with existential risk from AI, alignment problem, general landscape of who's building AI, and generally how that's going, and assume that people have played a bit with the most recent models, ChatGPT and that sort of thing, and have like some sense of what those models are like.

![](https://lh5.googleusercontent.com/Z0c46UEENDDUt4rwbzPmbdPock9NrO5-IRo3tPksD2UiI_OEXVHWM0FmEU2Xy5Thf5o8GF6ott3D-fFkkYpX35WFOWO8u3NG1KIMahEhLCyYDEecO0mdLVVeqvtcd3kBCDL89U95LZvNPQ4YC4zv52g)

A bunch of people are doing different and very cool work on the broad topic of evaluating models for dangerous things. I think you can maybe pick two axes to separate these. One is whether you're trying to build some benchmark that can be run automatically, very cheaply, that gives you a quantitative result very fast versus something that takes many researchers and many, many months actively probing the model and interaction with it and things. And the other axis, which I think is somewhat correlated, is: how much are you doing science and looking at precursors of dangerous capabilities or dangerous properties we expect to exist in future models versus actually trying to take a current model, or a near-future imagined model, and be like, “Okay, how would we tell if this model was actually an x-risk?” Most of the other people who are working on this are generally more in this bottom left quadrant, which is more automated, \[involves\] evaluating scaling laws of these quantitative benchmarks. We are trying to be in this other corner, doing very high-effort evaluations that are more targeted on like, “when would we know this thing is an x-risk?” 

**Our plan**
============

![](https://lh5.googleusercontent.com/C2TjXz_WaVa8L8mwElbnN3UxSMxr37VDw1fl6BZm_IiNxQUdMFIP_b5TIF0Ua2j-wuOYc_ty3zbIVCd03pU93q2jkzURfmtf9oMlZIgnhWvX1vhA6QRUBOIXgF1DtyUA_VkEy2ZUOc1iYKSP7VG5VGQ)

So our overall plan is “Well, it sort of seems li

... (truncated, 33 KB total)

Resource ID: 14b7d36fb24c1627 | Stable ID: sid_yrrSCbnWjZ