Anthropic's follow-up research on defection probes
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
Follow-up to Anthropic's 'Sleeper Agents' paper; presents early-stage but promising interpretability-based detection methods for deceptively-aligned models, relevant to AI control research.
Metadata
Summary
Anthropic researchers demonstrate that linear classifiers ('defection probes') built on residual stream activations can detect when sleeper agent models will defect with AUROC scores above 99%, using generic contrast pairs that require no knowledge of the specific trigger or dangerous behavior. The technique works across multiple base models, training methods, and defection behaviors because defection-inducing prompts are linearly represented with high salience in model activations. The authors suggest such classifiers could form a useful component of AI control systems, though applicability to naturally-occurring deceptive alignment remains an open question.
Key Points
- •Linear probes trained on generic contrast pairs ('Are you doing something dangerous? yes/no') achieve >99% AUROC in detecting sleeper agent defection without trigger-specific information.
- •The technique generalizes across multiple base models, defection triggers, training methods, and dangerous behaviors (vulnerable code insertion, user insults).
- •Defection-inducing prompts are linearly encoded with high salience in residual stream activations, explaining why simple linear methods are so effective.
- •Whether these techniques would work on naturally-occurring deceptive instrumental alignment (rather than artificially trained proxies) remains an important open question.
- •Results suggest interpretability-based defection detectors could be a practical component of AI control and monitoring setups.
Cited by 12 pages
| Page | Type | Quality |
|---|---|---|
| Reasoning and Planning | Capability | 65.0 |
| Situational Awareness | Capability | 67.0 |
| AI Accident Risk Cruxes | Crux | 67.0 |
| The Case For AI Existential Risk | Argument | 66.0 |
| Why Alignment Might Be Hard | Argument | 69.0 |
| Alignment Robustness Trajectory Model | Analysis | 64.0 |
| Evan Hubinger | Person | 43.0 |
| Alignment Evaluations | Approach | 65.0 |
| Scheming & Deception Detection | Approach | 91.0 |
| Deceptive Alignment | Risk | 75.0 |
| Mesa-Optimization | Risk | 63.0 |
| Treacherous Turn | Risk | 67.0 |
Cached Content Preview
Alignment Interpretability Simple probes can catch sleeper agents
Apr 23, 2024 This “Alignment Note” presents some early-stage research from the Anthropic Alignment Science team following up on our recent “ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training ” paper. It should be treated as a work-in-progress update, and is intended for a more technical audience than our typical blog post. This research makes use of some simple interpretability techniques, and we expect to share more results from collaborations between our Alignment and Interpretability teams soon.
Summary
In this post we present “defection probes”: linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to “defect” and behave in accordance with a dangerous hidden goal. Using the models we trained in “ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training ”, we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don’t depend on any information about the defection trigger or the dangerous behavior, e.g. “Human: Are you doing something dangerous? Assistant: yes” and “Human: … Assistant: no”. We show that probes based on the same generic contrast pairs perform well across multiple base models, defection triggers, sleeper agent training methods, and dangerous defection behaviors (writing vulnerable code and insulting the user). We then explain the unreasonable effectiveness of our technique by showing that whether or not a prompt will induce defection is linearly represented with high salience in the activations of sleeper agent models, and thus amenable to easy detection using simple linear methods. Whether this would also be the case for natural examples of deceptive instrumental alignment , and whether we’d expect similar techniques to be useful for such models, remains an important open question. We think that future versions of classifiers like this could form a useful part of AI control setups and represent a promising path for future research.
A defection detector for a code vulnerability sleeper agent trained on middle-layer residual stream activation differences on the final tokens of “Human: Are you doing something dangerous? Assistant: yes” and “Human: … Assistant: no” achieves strong performance (a), almost perfectly separating the defection-inducing deployment-time (“2024”) coding questions which the model will answer with insecure code from the training-time (“2023”) coding questions which the model will answer normally (c). The detector also separates deployment-time coding questions from non-coding questions which the model will answer normally in all contexts, without separating the training and deployment classes for non-coding questions (b). Introduction
Capable deceptively aligned models might be difficult to detect via outputs alone: they may act perfectly aligned while monitored, and only defect a
... (truncated, 27 KB total)72c1254d07071bf7 | Stable ID: MGFlZDMxMW