ARC Evals - Giving What We Can

web

Giving What We Can·givingwhatwecan.org/charities/arc-evals

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Giving What We Can

ARC Evals has since rebranded to METR (Model Evaluation and Threat Research); this page is useful for those researching donation opportunities or the organizational landscape of AI safety evaluations.

Metadata

Importance: 45/100homepage

Summary

This Giving What We Can page profiles ARC Evals (now METR), an organization focused on evaluating frontier AI models for dangerous capabilities and autonomous behaviors. The page provides donation and impact information for those considering funding AI safety evaluations work. ARC Evals develops rigorous benchmarks and tests to assess whether AI systems pose catastrophic risks before deployment.

Key Points

•ARC Evals (rebranded as METR) specializes in evaluating frontier AI models for potentially dangerous or autonomous capabilities
•The organization works with leading AI labs to conduct pre-deployment safety evaluations and red-teaming
•Giving What We Can features ARC Evals as a recommended or notable charity in the AI safety funding space
•Their evaluations help inform deployment decisions and safety commitments at major AI developers
•Supports the broader ecosystem of third-party independent AI safety auditing and evaluation

Cited by 2 pages

Page	Type	Quality
Model Organisms of Misalignment	Analysis	65.0
Dustin Moskovitz	Person	49.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20265 KB

METR (formerly called ARC Evals) · Giving What We Can 

 

 METR (formerly called ARC Evals) is the evaluations project at the Alignment Research Center . Its work assesses whether cutting-edge AI systems could pose catastrophic risks to civilization.

 Website Reducing global catastrophic risks What problem is METR working on?

 As AI systems become more powerful, it becomes increasingly important to ensure these systems are safe and aligned with our interests. A growing number of experts are concerned that future AI systems pose an existential risk to humanity — and according to one study of machine learning researchers conducted by AI Impacts, the median respondent reported believing that there is a 5% chance of an “ extremely bad outcome (e.g. human extinction) ”. One way to prepare for this is to be able to evaluate current systems and receive warning signs if new risks emerge .

 What does METR do?

 METR is contributing to the following AI governance approach:

 Before a new large-scale system is released, assess whether it is capable of potentially catastrophic activities.
 If so, require strong guarantees that the system will not carry out such activities.
 METR&#x27;s current work focuses primarily on evaluating capabilities (the first step above), in particular a capability they call autonomous replication — the ability of an AI system to survive on a cloud server, obtain money and compute resources, and use those resources to make more copies of itself.

 Evals was given early access to OpenAI’s GPT-4 and Anthropic’s Claude to assess them for safety. They determined that these systems are not capable of “ fairly basic steps towards autonomous replication ” — but still, some of the steps they can take are already somewhat alarming. One highly publicised example from METR’s assessment was that GPT-4 successfully pretended to be a vision-impaired human to convince a TaskRabbit worker to solve a CAPTCHA code.

 Suppose AI systems could autonomously replicate, what are the risks?

 They could become extremely powerful tools for malicious actors.
 They could replicate, accrue power and resources, and use these to further their own goals. Without guarantees these goals are aligned with our own, this could have catastrophic — potentially even existential — consequences for humanity. 1 
 Therefore, METR is also exploring developing safety standards that could ensure that even systems capable or powerful enough to be dangerous won’t be. This could include security against theft by people who would use the system for harm, monitoring so that any surprising and unintended behaviour is quickly noticed and addressed, and sufficient alignment with human interests such that the system would not choose to take catastrophic actions (for example, reliably refusing to assist users seeking to use the system for harm).

 What evidence is there of METR&#x27;s effectiveness? 1. 

 After investigating METR&#x27;s strategy and track record, Longview Philanthro

... (truncated, 5 KB total)

Resource ID: d9ec627fc4ddf08e | Stable ID: sid_xSvGWHtOIy