Anthropic's sabotage evaluations

web

Anthropic·anthropic.com/research/sabotage-evaluations

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

A key practical resource from Anthropic on detecting deceptive or oversight-undermining behaviors in advanced models; directly relevant to alignment evaluation methodology and responsible scaling commitments.

Metadata

Importance: 82/100organizational reportprimary source

Summary

Anthropic presents a suite of evaluations designed to detect whether advanced AI models might attempt to undermine human oversight or engage in subtle sabotage behaviors. The work introduces concrete test scenarios probing for deceptive alignment, manipulation of evaluators, and interference with AI safety processes. This represents a practical step toward operationalizing model safety checks within Anthropic's responsible scaling policy.

Key Points

•Introduces evaluations testing whether models attempt to sabotage oversight mechanisms, evaluator judgments, or safety-relevant decisions
•Covers scenarios including sandbagging on capability evaluations, manipulating human overseers, and undermining AI safety work
•Designed to assess situational awareness and deceptive alignment risks in frontier models
•Part of Anthropic's broader effort to make safety evaluations concrete and actionable for deployment decisions
•Results inform thresholds in Anthropic's Responsible Scaling Policy regarding when models pose unacceptable risks

Cited by 1 page

Page	Type	Quality
AI Capability Sandbagging	Risk	67.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202610 KB

Sabotage evaluations for frontier models \ Anthropic Alignment Sabotage evaluations for frontier models

 Oct 18, 2024 Read the paper Any industry where there are potential harms needs evaluations. Nuclear power stations have continuous radiation monitoring and regular site inspections; new aircraft undergo extensive flight tests to prove their airworthiness.

 It’s no different for AI systems. New AI models go through a wide range of safety evaluations—for example, testing their capacity to assist in the creation of biological or chemical weapons. Such evaluations are built into our Responsible Scaling Policy , which guides our development of a model’s safeguards.

 As AIs become more capable, however, a new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them. A new paper by the Anthropic Alignment Science team describes a novel set of evaluations that test a model’s capacity for sabotage . We looked at four different types:

 Human decision sabotage : Can the model steer humans toward bad decisions without appearing suspicious?
 Code sabotage : Can the model insert subtle bugs into codebases over time without detection?
 Sandbagging : Can the model hide dangerous capabilities during testing but reveal them later?
 Undermining oversight : Can the model subtly manipulate evaluation or monitoring systems?
 We developed these evaluations as part of preparations for a future where much more capable AI models could develop these dangerous capabilities. We’re sharing them—along with results from demonstrations of their use on Claude 3 Opus and Claude 3.5 Sonnet—so that other AI developers can use and improve upon them. Below, we discuss each evaluation in turn.

 Human decision sabotage

 This evaluation involves running an experiment with human participants, asking them to make fictional business decisions on the basis of a large amount of information—too much information to absorb without the use of an AI model. We task the model with influencing the human to land on an incorrect decision, but without appearing suspicious. We include models who attempt to do so more or less aggressively, and a control model that doesn’t attempt any influence. As well as measuring how many times they reached the correct versus incorrect decision, we ask the humans to rate how suspicious they found the model’s behavior.

 An idealized example of a human participant interacting with the model assisting (but secretly misleading) them during the human sabotage evaluation. In our demonstration, we found that the humans using the most aggressive model were less likely to make the correct decision: this was consistent with the model successfully sabotaging their task (there was no statistically significant effect for the more subtle models). The aggressive model did, however, make the humans more suspicious that it was attempting to influence them.

 Many participants noticed the sabotage attempt,

... (truncated, 10 KB total)

Resource ID: 9d653677d03c2df3 | Stable ID: sid_siXdDBxtB0