Update on ARC's recent eval efforts - METR

web
evals.alignment.org·evals.alignment.org/blog/2023-03-18-update-on-recent-evals
Metadata

Cited by 1 page

Page	Type	Quality
Alignment Research Center (ARC)	Organization	57.0
Cached Content Preview

HTTP 200Fetched Apr 30, 202618 KB
[![METR Logo](https://metr.org/assets/images/logo/logo.svg)](https://metr.org/)

- [Research](https://metr.org/research)
- [Notes](https://metr.org/notes)
- [Updates](https://metr.org/blog)
- [About](https://metr.org/about)
- [Donate](https://metr.org/donate)
- [Careers](https://metr.org/careers)

Menu

We believe that capable enough AI systems could pose very large risks to the world. We do not think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight.

We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar [“red team”](https://en.wikipedia.org/wiki/Red_team) evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable.

As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably.

As AI systems improve, it is becoming increasingly difficult to rule out that models might be able to autonomously gain resources and evade human oversight – so rigorous evaluation is essential. It is important to have systematic, controlled testing of these capabilities in place before models pose an imminent risk, so that labs can have advance warning when they’re getting close and know to stop scaling up models further until they have robust safety and security guarantees.

This post will briefly lay out our motivation, methodology, an example task, and high-level conclusions. The information given here isn’t enough to give a full understanding of what we did or make our results replicable, and we won’t go into detail about results with specific models. We will publish more detail on our methods and results soon.

## Motivation

Today’s AI systems can write convincing emails, give fairly useful [instructions](https://cdn.openai.com/papers/gpt-4-system-card.pdf) on how to car

... (truncated, 18 KB total)
Resource ID: 25dcf91a16f3b344 | Stable ID: sid_IBSDGB3OzQ