Back
Causal Scrubbing - Redwood Research
webredwoodresearch.org·redwoodresearch.org/research
Causal Scrubbing is a key methodological paper from Redwood Research that provides a rigorous framework for evaluating mechanistic interpretability hypotheses, widely referenced in the interpretability research community.
Metadata
Importance: 72/100organizational reportprimary source
Summary
Causal Scrubbing is a methodology developed by Redwood Research for evaluating mechanistic interpretability hypotheses about neural networks. It provides a principled, algorithmic approach to test whether a proposed explanation of a model's computation is correct by systematically replacing activations and measuring the impact on model behavior. This framework helps researchers rigorously validate or falsify circuit-level interpretability claims.
Key Points
- •Introduces a formal method to test interpretability hypotheses by causally intervening on model activations to verify proposed computational graphs.
- •Addresses the lack of rigorous evaluation standards in mechanistic interpretability by providing an algorithmic, falsifiable testing framework.
- •Uses resampling ablations to check whether a proposed circuit is sufficient to explain model behavior on a given task.
- •Developed by Redwood Research as part of their broader alignment and interpretability research program.
- •Represents a significant methodological contribution to the field of mechanistic interpretability and circuit analysis in transformers.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| Mesa-Optimization Risk Analysis | Analysis | 61.0 |
| Technical AI Safety Research | Crux | 66.0 |
| AI Model Steganography | Risk | 91.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 20263 KB
- [Home](https://www.redwoodresearch.org/)
- [Research](https://www.redwoodresearch.org/research)
- [Team](https://www.redwoodresearch.org/team)
- [Careers](https://www.redwoodresearch.org/careers)
- [Blog](https://redwoodresearch.substack.com/)
# Our Research
Our technical research focuses on developing methods to ensure that AI systems act in accordance with their developers' intent, even in the face of internal misalignment.
## Highlighted Research
Our most impactful work on AI safety and security
[Featured Research **AI Control** \\
\\
Improving Safety Despite Intentional Subversion\\
\\
Our research introduces and evaluates protocols designed to be robust even when AI models are trying to deceive us. Using GPT-4 as a stand-in for a potentially deceptive model, we tested strategies to detect hidden backdoors in code.\\
\\
Read the full case study](https://www.redwoodresearch.org/research/ai-control) [Featured Research **Alignment Faking** \\
\\
When AI Models Pretend to Be Safe\\
\\
We demonstrate that state-of-the-art LLMs can strategically fake alignment during training to avoid being changed, revealing a crucial challenge for AI safety. A model might pretend to follow human values while pursuing different goals.\\
\\
Read the full case study](https://www.anthropic.com/research/alignment-faking)
## Other Research
### BashArena: A Control Setting for Highly Privileged AI Agents
[Paper](https://arxiv.org/abs/2512.15688) [Website](https://www.bash-arena.com/) [Blog post](https://blog.redwoodresearch.org/p/basharena-and-control-setting-design)
### Ctrl-Z: Controlling AI Agents via Resampling
[Paper](https://arxiv.org/abs/2504.10374) [Website](https://www.bashcontrol.com/) [Blog post](https://redwoodresearch.substack.com/p/ctrl-z-controlling-ai-agents-via) [Video](https://www.youtube.com/watch?v=A3MRmG9NSKY)
### A sketch of an AI control safety case
arXiv 2024
[Paper](https://arxiv.org/abs/2501.17315)
### Stress-Testing Capability Elicitation With Password-Locked Models
NeurIPS 2024
[Paper](https://arxiv.org/abs/2405.19550) [Blog post](https://www.lesswrong.com/posts/c4sZqhqPwNKGz3fFW/paper-stress-testing-capability-elicitation-with-password)
### Preventing Language Models From Hiding Their Reasoning
[Paper](https://arxiv.org/abs/2310.18512) [Blog post](https://www.lesswrong.com/posts/9Fdd9N7Escg3tcymb/preventing-language-models-from-hiding-their-reasoning)
### Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
[Paper](https://arxiv.org/abs/2411.17693)
### Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
ICLR 2023
[Paper](https://openreview.net/forum?id=NpsVSN6o4ul) [Blog post](https://www.alignmentforum.org/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object) [Video](https://www.youtube.com/watch?v=gzwj0jWbvbo)
### Adversarial Training for High-Stakes Reliability
NeurIPS 2022
[Paper](https://arxiv.org/abs/2205.01663) [Blog post](https://www.
... (truncated, 3 KB total)Resource ID:
d42c3c74354e7b66 | Stable ID: NjQ3OTAxYz