Causal Scrubbing - Redwood Research

web

redwoodresearch.org·redwoodresearch.org/research

Causal Scrubbing is a key methodological paper from Redwood Research that provides a rigorous framework for evaluating mechanistic interpretability hypotheses, widely referenced in the interpretability research community.

Metadata

Importance: 72/100organizational reportprimary source

Summary

Causal Scrubbing is a methodology developed by Redwood Research for evaluating mechanistic interpretability hypotheses about neural networks. It provides a principled, algorithmic approach to test whether a proposed explanation of a model's computation is correct by systematically replacing activations and measuring the impact on model behavior. This framework helps researchers rigorously validate or falsify circuit-level interpretability claims.

Key Points

•Introduces a formal method to test interpretability hypotheses by causally intervening on model activations to verify proposed computational graphs.
•Addresses the lack of rigorous evaluation standards in mechanistic interpretability by providing an algorithmic, falsifiable testing framework.
•Uses resampling ablations to check whether a proposed circuit is sufficient to explain model behavior on a given task.
•Developed by Redwood Research as part of their broader alignment and interpretability research program.
•Represents a significant methodological contribution to the field of mechanistic interpretability and circuit analysis in transformers.

Cited by 3 pages

Page	Type	Quality
Mesa-Optimization Risk Analysis	Analysis	61.0
Technical AI Safety Research	Crux	66.0
AI Model Steganography	Risk	91.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20262 KB

Redwood Research Home 
 Research 
 Team 
 Careers 
 Blog 
 Our Research

 Our technical research focuses on developing methods to ensure that AI systems act in accordance with their developers&#x27; intent, even in the face of internal misalignment.

 Highlighted Research

 Our most impactful work on AI safety and security

 Featured Research AI Control

 Improving Safety Despite Intentional Subversion

 Our research introduces and evaluates protocols designed to be robust even when AI models are trying to deceive us. Using GPT-4 as a stand-in for a potentially deceptive model, we tested strategies to detect hidden backdoors in code.

 Read the full case study Featured Research Alignment Faking

 When AI Models Pretend to Be Safe

 We demonstrate that state-of-the-art LLMs can strategically fake alignment during training to avoid being changed, revealing a crucial challenge for AI safety. A model might pretend to follow human values while pursuing different goals.

 Read the full case study Other Research

 BashArena: A Control Setting for Highly Privileged AI Agents

 Paper Website Blog post Ctrl-Z: Controlling AI Agents via Resampling

 Paper Website Blog post Video A sketch of an AI control safety case

 arXiv 2024

 Paper Stress-Testing Capability Elicitation With Password-Locked Models

 NeurIPS 2024

 Paper Blog post Preventing Language Models From Hiding Their Reasoning

 Paper Blog post Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

 Paper Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small

 ICLR 2023

 Paper Blog post Video Adversarial Training for High-Stakes Reliability

 NeurIPS 2022

 Paper Blog post

Resource ID: d42c3c74354e7b66 | Stable ID: sid_hidngjiMky