Skip to content
Longterm Wiki
Back

open-source automated interpretability

web

This EleutherAI blog post and associated codebase provides an open-source alternative to closed-lab automated interpretability pipelines, relevant for researchers studying how to understand internal representations of large language models via sparse autoencoders.

Metadata

Importance: 62/100blog posttool

Summary

EleutherAI introduces an open-source pipeline for automated interpretability of neural network features, particularly targeting sparse autoencoder (SAE) features. The project automates the process of generating natural language explanations for model internals and scoring their quality, making mechanistic interpretability research more scalable and accessible. It builds on prior work like OpenAI's automated interpretability but releases tooling publicly.

Key Points

  • Releases an open-source library for automated generation and evaluation of natural language explanations for SAE features in language models.
  • Automates the interpretability pipeline: detecting active features, generating explanations via LLMs, and scoring explanation quality.
  • Aims to democratize mechanistic interpretability research by providing scalable, reproducible tooling outside of closed labs.
  • Benchmarks explanation quality using detection and fuzzing scoring methods to assess whether explanations accurately capture feature behavior.
  • Supports integration with sparse autoencoders trained on open models, enabling community-driven interpretability research.

Cited by 2 pages

PageTypeQuality
InterpretabilityResearch Area66.0
Sparse Autoencoders (SAEs)Approach91.0

Cached Content Preview

HTTP 200Fetched Feb 26, 202636 KB
Table of Contents

- [Background](https://blog.eleuther.ai/autointerp/#background)
- [Key Findings](https://blog.eleuther.ai/autointerp/#key-findings)
- [Generating Explanations](https://blog.eleuther.ai/autointerp/#generating-explanations)
- [Scoring explanations](https://blog.eleuther.ai/autointerp/#scoring-explanations)
- [Results](https://blog.eleuther.ai/autointerp/#results)
- [Explainers](https://blog.eleuther.ai/autointerp/#explainers)  - [How does the explainer model size affect explanation quality?](https://blog.eleuther.ai/autointerp/#how-does-the-explainer-model-size-affect-explanation-quality)
  - [Providing more information to the explainer](https://blog.eleuther.ai/autointerp/#providing-more-information-to-the-explainer)
  - [Giving the explainer different samples of top activating examples](https://blog.eleuther.ai/autointerp/#giving-the-explainer-different-samples-of-top-activating-examples)
  - [Visualizing activation distributions](https://blog.eleuther.ai/autointerp/#visualizing-activation-distributions)
- [Scorers](https://blog.eleuther.ai/autointerp/#scorers)  - [How do methods correlate with simulation?](https://blog.eleuther.ai/autointerp/#how-do-methods-correlate-with-simulation)
  - [How does scorer model size affect scores?](https://blog.eleuther.ai/autointerp/#how-does-scorer-model-size-affect-scores)
  - [How much more scalable is detection/fuzzing?](https://blog.eleuther.ai/autointerp/#how-much-more-scalable-is-detectionfuzzing)
- [Filtering with known heuristics](https://blog.eleuther.ai/autointerp/#filtering-with-known-heuristics)  - [Positional Features](https://blog.eleuther.ai/autointerp/#positional-features)
  - [Unigram features](https://blog.eleuther.ai/autointerp/#unigram-features)
- [Sparse Feature Circuits](https://blog.eleuther.ai/autointerp/#sparse-feature-circuits)
- [Future Directions](https://blog.eleuther.ai/autointerp/#future-directions)
- [Appendix](https://blog.eleuther.ai/autointerp/#appendix)

## Background [\#](https://blog.eleuther.ai/autointerp/\#background)

Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring arbitrary text explanations of SAE features, and release a open source library to allow people to do research on auto-interpreted features.

## Key Findings [\#](https://blog.eleuther.ai/autointerp/\#key-findings)

- Open source models generate and evaluate text explanations of SAE features reasonably well, albeit somewhat worse than closed models like Claude 3.5 Sonnet.

- Explanations found by LLMs are similar to explanations found by humans.

- Automatically interpreting 1.5M features of GPT-2 with the current pipeline would cost $1300 in API calls to Llama 3.1 or $8500 with Claude 3.5 Sonnet. Prior methods cost ~$200k.

- Code can be found at [https://github.com/EleutherAI/sae-auto-interp](https://github.com/EleutherAI/sae-auto-int

... (truncated, 36 KB total)
Resource ID: daaf778f7ff52bc2 | Stable ID: MDQyNmMyZT