Skip to content
Longterm Wiki
Back

Causal Scrubbing - Redwood Research

web
redwoodresearch.org·redwoodresearch.org/research

Causal Scrubbing is a key methodological paper from Redwood Research that provides a rigorous framework for evaluating mechanistic interpretability hypotheses, widely referenced in the interpretability research community.

Metadata

Importance: 72/100organizational reportprimary source

Summary

Causal Scrubbing is a methodology developed by Redwood Research for evaluating mechanistic interpretability hypotheses about neural networks. It provides a principled, algorithmic approach to test whether a proposed explanation of a model's computation is correct by systematically replacing activations and measuring the impact on model behavior. This framework helps researchers rigorously validate or falsify circuit-level interpretability claims.

Key Points

  • Introduces a formal method to test interpretability hypotheses by causally intervening on model activations to verify proposed computational graphs.
  • Addresses the lack of rigorous evaluation standards in mechanistic interpretability by providing an algorithmic, falsifiable testing framework.
  • Uses resampling ablations to check whether a proposed circuit is sufficient to explain model behavior on a given task.
  • Developed by Redwood Research as part of their broader alignment and interpretability research program.
  • Represents a significant methodological contribution to the field of mechanistic interpretability and circuit analysis in transformers.

Cited by 3 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 20263 KB
- [Home](https://www.redwoodresearch.org/)
- [Research](https://www.redwoodresearch.org/research)
- [Team](https://www.redwoodresearch.org/team)
- [Careers](https://www.redwoodresearch.org/careers)
- [Blog](https://redwoodresearch.substack.com/)

# Our Research

Our technical research focuses on developing methods to ensure that AI systems act in accordance with their developers' intent, even in the face of internal misalignment.

## Highlighted Research

Our most impactful work on AI safety and security

[Featured Research **AI Control** \\
\\
Improving Safety Despite Intentional Subversion\\
\\
Our research introduces and evaluates protocols designed to be robust even when AI models are trying to deceive us. Using GPT-4 as a stand-in for a potentially deceptive model, we tested strategies to detect hidden backdoors in code.\\
\\
Read the full case study](https://www.redwoodresearch.org/research/ai-control) [Featured Research **Alignment Faking** \\
\\
When AI Models Pretend to Be Safe\\
\\
We demonstrate that state-of-the-art LLMs can strategically fake alignment during training to avoid being changed, revealing a crucial challenge for AI safety. A model might pretend to follow human values while pursuing different goals.\\
\\
Read the full case study](https://www.anthropic.com/research/alignment-faking)

## Other Research

### BashArena: A Control Setting for Highly Privileged AI Agents

[Paper](https://arxiv.org/abs/2512.15688) [Website](https://www.bash-arena.com/) [Blog post](https://blog.redwoodresearch.org/p/basharena-and-control-setting-design)

### Ctrl-Z: Controlling AI Agents via Resampling

[Paper](https://arxiv.org/abs/2504.10374) [Website](https://www.bashcontrol.com/) [Blog post](https://redwoodresearch.substack.com/p/ctrl-z-controlling-ai-agents-via) [Video](https://www.youtube.com/watch?v=A3MRmG9NSKY)

### A sketch of an AI control safety case

arXiv 2024

[Paper](https://arxiv.org/abs/2501.17315)

### Stress-Testing Capability Elicitation With Password-Locked Models

NeurIPS 2024

[Paper](https://arxiv.org/abs/2405.19550) [Blog post](https://www.lesswrong.com/posts/c4sZqhqPwNKGz3fFW/paper-stress-testing-capability-elicitation-with-password)

### Preventing Language Models From Hiding Their Reasoning

[Paper](https://arxiv.org/abs/2310.18512) [Blog post](https://www.lesswrong.com/posts/9Fdd9N7Escg3tcymb/preventing-language-models-from-hiding-their-reasoning)

### Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

[Paper](https://arxiv.org/abs/2411.17693)

### Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small

ICLR 2023

[Paper](https://openreview.net/forum?id=NpsVSN6o4ul) [Blog post](https://www.alignmentforum.org/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object) [Video](https://www.youtube.com/watch?v=gzwj0jWbvbo)

### Adversarial Training for High-Stakes Reliability

NeurIPS 2022

[Paper](https://arxiv.org/abs/2205.01663) [Blog post](https://www.

... (truncated, 3 KB total)
Resource ID: d42c3c74354e7b66 | Stable ID: NjQ3OTAxYz