Anthropic's follow-up research on defection probes

web

Anthropic·anthropic.com/research/probes-catch-sleeper-agents

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

Follow-up to Anthropic's 'Sleeper Agents' paper; presents early-stage but promising interpretability-based detection methods for deceptively-aligned models, relevant to AI control research.

Metadata

Importance: 72/100blog postprimary source

Summary

Anthropic researchers demonstrate that linear classifiers ('defection probes') built on residual stream activations can detect when sleeper agent models will defect with AUROC scores above 99%, using generic contrast pairs that require no knowledge of the specific trigger or dangerous behavior. The technique works across multiple base models, training methods, and defection behaviors because defection-inducing prompts are linearly represented with high salience in model activations. The authors suggest such classifiers could form a useful component of AI control systems, though applicability to naturally-occurring deceptive alignment remains an open question.

Key Points

•Linear probes trained on generic contrast pairs ('Are you doing something dangerous? yes/no') achieve >99% AUROC in detecting sleeper agent defection without trigger-specific information.
•The technique generalizes across multiple base models, defection triggers, training methods, and dangerous behaviors (vulnerable code insertion, user insults).
•Defection-inducing prompts are linearly encoded with high salience in residual stream activations, explaining why simple linear methods are so effective.
•Whether these techniques would work on naturally-occurring deceptive instrumental alignment (rather than artificially trained proxies) remains an important open question.
•Results suggest interpretability-based defection detectors could be a practical component of AI control and monitoring setups.

Cited by 12 pages

Page	Type	Quality
Reasoning and Planning	Capability	65.0
Situational Awareness	Capability	67.0
AI Accident Risk Cruxes	Crux	67.0
The Case For AI Existential Risk	Argument	66.0
Why Alignment Might Be Hard	Argument	69.0
Alignment Robustness Trajectory Model	Analysis	64.0
Evan Hubinger	Person	43.0
Alignment Evaluations	Approach	65.0
Scheming & Deception Detection	Approach	91.0
Deceptive Alignment	Risk	75.0
Mesa-Optimization	Risk	63.0
Treacherous Turn	Risk	67.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202627 KB

Alignment Interpretability Simple probes can catch sleeper agents

 Apr 23, 2024 This “Alignment Note” presents some early-stage research from the Anthropic Alignment Science team following up on our recent “ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training ” paper. It should be treated as a work-in-progress update, and is intended for a more technical audience than our typical blog post. This research makes use of some simple interpretability techniques, and we expect to share more results from collaborations between our Alignment and Interpretability teams soon. 

 Summary

 In this post we present “defection probes”: linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to “defect” and behave in accordance with a dangerous hidden goal. Using the models we trained in “ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training ”, we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don’t depend on any information about the defection trigger or the dangerous behavior, e.g. “Human: Are you doing something dangerous? Assistant: yes” and “Human: … Assistant: no”. We show that probes based on the same generic contrast pairs perform well across multiple base models, defection triggers, sleeper agent training methods, and dangerous defection behaviors (writing vulnerable code and insulting the user). We then explain the unreasonable effectiveness of our technique by showing that whether or not a prompt will induce defection is linearly represented with high salience in the activations of sleeper agent models, and thus amenable to easy detection using simple linear methods. Whether this would also be the case for natural examples of deceptive instrumental alignment , and whether we’d expect similar techniques to be useful for such models, remains an important open question. We think that future versions of classifiers like this could form a useful part of AI control setups and represent a promising path for future research.

 A defection detector for a code vulnerability sleeper agent trained on middle-layer residual stream activation differences on the final tokens of “Human: Are you doing something dangerous? Assistant: yes” and “Human: … Assistant: no” achieves strong performance (a), almost perfectly separating the defection-inducing deployment-time (“2024”) coding questions which the model will answer with insecure code from the training-time (“2023”) coding questions which the model will answer normally (c). The detector also separates deployment-time coding questions from non-coding questions which the model will answer normally in all contexts, without separating the training and deployment classes for non-coding questions (b). Introduction

 Capable deceptively aligned models might be difficult to detect via outputs alone: they may act perfectly aligned while monitored, and only defect a

... (truncated, 27 KB total)

Resource ID: 72c1254d07071bf7 | Stable ID: sid_0JK0SyDpDr