Skip to content
Longterm Wiki
Back

Mapping the Mind of a Large Language Model

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

Landmark Anthropic research (May 2024) applying sparse autoencoders at scale to Claude Sonnet; widely considered a milestone in mechanistic interpretability and directly relevant to AI safety via feature-level understanding and intervention in deployed models.

Metadata

Importance: 90/100blog postprimary source

Summary

Anthropic reports a major interpretability breakthrough: using scaled dictionary learning (sparse autoencoders), they identified millions of human-interpretable features inside Claude Sonnet, marking the first detailed look inside a production-grade LLM. The work demonstrates that concepts like emotions, ethics, and safety-relevant behaviors can be located and manipulated within the model's internal representations.

Key Points

  • Scaled dictionary learning (sparse autoencoders) to a production LLM (Claude Sonnet), extracting millions of interpretable features from neuron activation patterns.
  • Features correspond to rich, multi-modal concepts including emotions, countries, famous people, and notably safety-relevant concepts like deception and manipulation.
  • Demonstrated causal intervention: artificially activating specific features measurably changes model behavior, confirming features are mechanistically meaningful.
  • Found evidence of 'dark triad' personality features and features related to the Assistant token, with implications for understanding sycophancy and safety.
  • Represents orders-of-magnitude scale-up from prior toy-model work, using scaling laws to tune the dictionary learning approach efficiently.

Cited by 2 pages

PageTypeQuality
AI Scaling LawsConcept92.0
Technical AI Safety ResearchCrux66.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202616 KB
Interpretability

# Mapping the Mind of a Large Language Model

May 21, 2024

[Read the paper](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)

![](https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F80d6e033480704f5d57fbae4e3f0368d86a747ae-5761x3240.png&w=3840&q=75)

_Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model._ _This interpretability discovery could, in future, help us make AI models safer._

We mostly treat AI models as a black box: something goes in and a response comes out, and it's not clear why the model gave that particular response instead of another. This makes it hard to trust that these models are safe: if we don't know how they work, how do we know they won't give harmful, biased, untruthful, or otherwise dangerous responses? How can we trust that they’ll be safe and reliable?

Opening the black box doesn't necessarily help: the internal state of the model—what the model is "thinking" before writing its response—consists of a long list of numbers ("neuron activations") without a clear meaning. From interacting with a model like Claude, it's clear that it’s able to understand and wield a wide range of concepts—but we can't discern them from looking directly at neurons. It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.

Previously, we made some progress matching _patterns_ of neuron activations, called features, to human-interpretable concepts. We used a technique called "dictionary learning", borrowed from classical machine learning, which isolates patterns of neuron activations that recur across many different contexts. In turn, any internal state of the model can be represented in terms of a few active features instead of many active neurons. Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features.

In October 2023, [we reported](https://www.anthropic.com/news/decomposing-language-models-into-understandable-components) success applying dictionary learning to a very small "toy" language model and found coherent features corresponding to concepts like uppercase text, DNA sequences, surnames in citations, nouns in mathematics, or function arguments in Python code.

Those concepts were intriguing—but the model really was very simple. [Other](https://www.neuronpedia.org/gpt2-small/res-jb) [researchers](https://github.com/openai/transformer-debugger) [subsequently](https://arxiv.org/abs/2403.19647) [applied](https://arxiv.org/abs/2404.16014) similar 

... (truncated, 16 KB total)
Resource ID: 5019b9256d83a04c | Stable ID: YmY2ZmRjZT