Beth Barnes - Safety evaluations and standards for AI (EA Forum)
webAuthor
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: EA Forum
Transcript and video of Beth Barnes's EAG 2023 talk outlining ARC Evals' strategy; directly relevant to the emerging field of pre-deployment safety evaluations and the development of industry standards around dangerous capability thresholds.
Forum Post Details
Metadata
Summary
Beth Barnes of ARC Evals presents the case for high-effort, targeted safety evaluations as a promising intervention for reducing existential risk from AI. She focuses on the 'autonomous replication' threshold as a key danger signal and explains how evaluations can create accountability mechanisms that encourage labs to meet safety standards before deployment.
Key Points
- •ARC Evals focuses on high-effort, human-intensive evaluations targeting 'when would we know this model is an x-risk?' rather than automated benchmarks.
- •The 'autonomous replication' threshold—whether a model can self-replicate and acquire resources autonomously—is proposed as a key danger signal for deployment decisions.
- •Evaluations are framed as a governance lever: creating standards that labs can agree to, linking safety assessments to deployment decisions.
- •A core challenge is that evaluations must be anticipatory, identifying early warning signs before dangerous capabilities fully emerge in deployed models.
- •The approach addresses alignment indirectly by establishing external accountability rather than solving alignment technically.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| METR | Organization | 66.0 |
Cached Content Preview
Safety evaluations and standards for AI | Beth Barnes | EAG Bay Area 23 — EA Forum
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. Hide table of contents Safety evaluations and standards for AI | Beth Barnes | EAG Bay Area 23
by Beth Barnes Jun 16 2023 20 min read 0 28
AI safety AI evaluations and standards Alignment Research Center Frontpage Safety evaluations and standards for AI | Beth Barnes | EAG Bay Area 23 Background Our plan A high level story of AI x-risk Overall idea “Autonomous replication” threshold An example task Why this threshold? Downsides of this eval Missing some threat models Humans in the loop = subjective and expensive FAQs: possible issues and how we’re going to deal with them Why would labs agree to safety standards? How does alignment come into it? Are you worried about accelerating capabilities? No comments This is a video and transcript for a talk given at EA Global, Bay Area 2023.
You can see the slides for this talk here
Here's the full transcript with slides:
Hey everyone, thanks for coming. So, I am going to be talking about evaluations - why I think evaluations are a promising intervention for reducing existential risk from AI. First, I'm just going to give a little bit of background and situation and what we mean by this evals thing. Then, I'm going to motivate why this might be a promising way to reduce AI x-risk, explain why we feel like we're getting some amount of traction with our current projects, and finally answer some questions I often get asked about eval stuff.
Background
So, background: what do I mean by evals? But before we start, I'm going to basically assume that everyone is familiar with existential risk from AI, alignment problem, general landscape of who's building AI, and generally how that's going, and assume that people have played a bit with the most recent models, ChatGPT and that sort of thing, and have like some sense of what those models are like.
A bunch of people are doing different and very cool work on the broad topic of evaluating models for dangerous things. I think you can maybe pick two axes to separate these. One is whether you're trying to build some benchmark that can be run automatically, very cheaply, that gives you a quantitative result very fast versus something that takes many researchers and many, many months actively probing the model and interaction with it and things. And the other axis, which I think is somewhat correlated, is: how much are you doing science and looking at precursors of dangerous capabilities or dangerous properties we expect to exist in future models versus actually trying to take a current model, or a near-future imagined model, and be like, “Okay, how would we tell if this model was actually an x-risk?” Most of the other people who are working on this are generally more in this bottom left quadrant, w
... (truncated, 29 KB total)14b7d36fb24c1627 | Stable ID: Yjk1ZWU5Nj