Skip to content
Longterm Wiki

An update on our general capability evaluations - METR

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

Metadata

Cited by 1 page

PageTypeQuality
Alignment Research Center (ARC)Organization57.0

Cached Content Preview

HTTP 200Fetched Apr 30, 202617 KB
[![METR Logo](https://metr.org/assets/images/logo/logo.svg)](https://metr.org/)

- [Research](https://metr.org/research)
- [Notes](https://metr.org/notes)
- [Updates](https://metr.org/blog)
- [About](https://metr.org/about)
- [Donate](https://metr.org/donate)
- [Careers](https://metr.org/careers)

Menu

Since releasing our [autonomy evaluation resources](https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/), METR has primarily focused on developing better measures of AI capabilities. We are currently developing evaluations for AI R&D capabilities that may serve as “red lines”,\[1\][1](https://metr.org/blog/2024-08-06-update-on-evaluations/#fn:1) as well as a continuous measure of autonomous capabilities in general. In this update, we’ll talk about our recent efforts on developing evaluations for general autonomous capabilities – we plan on providing an update on our AI R&D evaluations in the near future. Note that this post provides an intermediate research update for those interested in our work and includes preliminary results that should not be considered definitive.

**Why not focus solely on “red line” evaluations for specific threat models?** Unlike the ability to develop bioweapons or execute high-value cyberattacks, possessing autonomy “in general” doesn’t directly enable an AI system to cause catastrophically bad outcomes. However, we still think that it’s worth developing more general measures of autonomous capabilities for several reasons:

1. Autonomy is a proxy for the extent to which an AI system can have far reaching impacts on the world, with minimal human involvement. We believe that such a metric might be robustly useful across a variety of threat models.
2. One of our goals for evaluations is to assist with forecasting the capabilities or impacts of AI systems. For this, we expect a general capability measure that provides signal at various scales to be more useful than evaluations that contain only hard tasks more directly tied to threat models.
3. We’d like AI developers and evaluation organizations to experiment with evaluation and elicitation procedures. We expect that working on suites of tasks that span a variety of difficulties, including those where current systems succeed or are likely to succeed given moderate increases in scale or post-training enhancements, will provide a more useful feedback loop than working solely on red lines.

![Diagram contrasting 'red-line' thresholds and general metrics.](https://metr.org/assets/images/aug2024_threshold_vs_metric.png)Figure 1: Our work is split between creating threat-specific evaluation suites for use in “red lines”, and creating a general autonomous capability measure that provides signal in advance of a red line being crossed.

**Creating tasks that cover a broad variety of capabilities and difficulty levels.** Building on [our previous example task suite](https://github.com/METR/public-tasks), we’ve developed around 50 more automatically scored tasks that meas

... (truncated, 17 KB total)
Resource ID: 81cf326834f21a67 | Stable ID: sid_lfTI2FIzkg