Gemma Scope: Open Sparse Autoencoders for Gemma 2 Interpretability
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Google DeepMind
A significant practical resource for mechanistic interpretability researchers; provides ready-to-use sparse autoencoders for Gemma 2, removing a major infrastructure bottleneck for safety-relevant feature analysis.
Metadata
Summary
Gemma Scope is an open-source suite of sparse autoencoders (SAEs) trained on Gemma 2 language models, released by Google DeepMind to support mechanistic interpretability research. It provides pre-trained SAEs across all layers and model sizes, enabling researchers to decompose internal activations into interpretable features. The release aims to lower the barrier for the safety community to study how language models process and represent information.
Key Points
- •Releases comprehensive sparse autoencoders (SAEs) trained on Gemma 2 (2B and 9B) at every layer, covering residual stream, attention, and MLP components.
- •Designed to support mechanistic interpretability research by providing tools to identify and analyze interpretable features inside LLMs.
- •Open release targets the broader AI safety community, reducing the compute burden of training SAEs from scratch.
- •Includes JumpReLU SAEs, a variant that improves reconstruction quality and feature sparsity compared to standard ReLU-based approaches.
- •Accompanied by Neuronpedia integration for interactive feature visualization and exploration.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Is Interpretability Sufficient for Safety? | Crux | 49.0 |
Cached Content Preview
[Skip to main content](https://deepmind.google/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/#page-content)
July 31, 2024
Models
# Gemma Scope: helping the safety community shed light on the inner workings of language models
Language Model Interpretability team
Share

Announcing a comprehensive, open suite of sparse autoencoders for language model interpretability.
To create an artificial intelligence (AI) language model, researchers build a system that learns from vast amounts of data without human guidance. As a result, the inner workings of language models are often a mystery, even to the researchers who train them. Mechanistic interpretability is a research field focused on deciphering these inner workings. Researchers in this field use sparse autoencoders as a kind of ‘microscope’ that lets them see inside a language model, and get a better sense of how it works.
Today, [we’re announcing Gemma Scope](https://developers.googleblog.com/en/smaller-safer-more-transparent-advancing-responsible-ai-with-gemma/), a new set of tools to help researchers understand the inner workings of Gemma 2, our lightweight family of open models. Gemma Scope is a collection of hundreds of freely available, open sparse autoencoders (SAEs) for [Gemma 2 9B](https://huggingface.co/google/gemma-2-9b) and [Gemma 2 2B](https://huggingface.co/google/gemma-2-2b). We're also open sourcing [Mishax](https://github.com/google-deepmind/mishax), a tool we built that enabled much of the interpretability work behind Gemma Scope.
We hope today’s release enables more ambitious interpretability research. Further research has the potential to help the field build more robust systems, develop better safeguards against model hallucinations, and protect against risks from autonomous AI agents like deception or manipulation.
[Try our interactive Gemma Scope demo](https://www.neuronpedia.org/gemma-scope), courtesy of Neuronpedia.
## Interpreting what happens inside a language model
When you ask a language model a question, it turns your text input into a series of ‘activations’. These activations map the relationships between the words you’ve entered, helping the model make connections between different words, which it uses to write an answer.
As the model processes text input, activations at different layers in the model’s neural network represent multiple increasingly advanced concepts, known as ‘features’.
For example, a model’s early layers might learn to [recall facts](https://arxiv.org/abs/2202.05262) like that [Michael Jordan plays basketball](https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall), while later layers may recognize more complex concepts like [the factuality of the text](https://
... (truncated, 10 KB total)a31c49bf9c1df71f | Stable ID: YWRhYmE3Zj