Skip to content
Longterm Wiki
Back

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

This Anthropic research page summarizes a landmark scaling experiment in mechanistic interpretability, relevant to anyone studying how to understand or audit the internal representations of large language models like Claude.

Metadata

Importance: 82/100blog postprimary source

Summary

Anthropic researchers applied sparse autoencoders to Claude Sonnet, successfully extracting approximately 10 million interpretable features from the model's internal representations. This work scales up mechanistic interpretability by identifying monosemantic features—individual directions in activation space corresponding to distinct human-understandable concepts. The findings represent a major step toward understanding what large language models have learned and how they represent knowledge internally.

Key Points

  • Scaled sparse autoencoder techniques to extract ~10 million features from Claude Sonnet, a production-scale LLM.
  • Features identified are largely monosemantic, each corresponding to a distinct, interpretable concept rather than a mix of unrelated concepts.
  • Demonstrates that mechanistic interpretability methods can scale beyond toy models to frontier AI systems.
  • Many extracted features align with human-understandable concepts including entities, behaviors, and abstract ideas.
  • Represents progress toward auditing and understanding what capabilities and concepts are encoded in large models.

Cited by 2 pages

PageTypeQuality
AI-Assisted AlignmentApproach63.0
Technical AI Safety ResearchCrux66.0

Cached Content Preview

HTTP 200Fetched Mar 20, 20267 KB
InterpretabilityResearch

# Decomposing Language Models Into Understandable Components

Oct 5, 2023

![](https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F01458fcaec292524542130b0ccd91becc1e19ba7-2880x1620.png&w=3840&q=75)

Neural networks are trained on data, not programmed to follow rules. With each step of training, millions or billions of parameters are updated to make the model better at tasks, and by the end, the model is capable of a dizzying array of behaviors. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe.

Neuroscientists face a similar problem with understanding the biological basis for human behavior. The neurons firing in a person's brain must somehow implement their thoughts, feelings, and decision-making. Decades of neuroscience research has revealed a lot about how the brain works, and enabled targeted treatments for diseases such as epilepsy, but much remains mysterious. Luckily for those of us trying to understand artificial neural networks, experiments are much, much easier to run. We can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.

Unfortunately, it turns out that the individual neurons do not have consistent relationships to network behavior. For example, [a single neuron](https://transformer-circuits.pub/2023/monosemantic-features/vis/a-neurons.html#feature-83) in a small language model is active in many unrelated contexts, including: academic citations, English dialogue, HTTP requests, and Korean text. In a classic vision model, a [single neuron](https://distill.pub/2017/feature-visualization/#diversity) responds to faces of cats and fronts of cars. The activation of one neuron can mean different things in different contexts.

![](https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2Fd7df7cb9e3c40cda2ce47926f856ec557974b48d-1259x788.png&w=3840&q=75)

In our latest paper, [_Towards Monosemanticity: Decomposing Language Models With Dictionary Learning_](https://transformer-circuits.pub/2023/monosemantic-features/index.html), we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations. This provides a path to breaking down complex neural networks into parts we can understand, and builds on previous efforts to interpret high-dimensional systems in neuroscience, machine learning, and statistics.

In a transfor

... (truncated, 7 KB total)
Resource ID: c355237bfc2d213d | Stable ID: OGMyMTYzM2