Skip to content
Longterm Wiki
Navigation
Updated 2026-01-30HistoryData
Page StatusResponse
Edited 2 months ago3.2k words2 backlinksUpdated every 3 weeksOverdue by 44 days
91QualityComprehensive19.5ImportancePeripheral29ResearchMinimal
Content7/13
SummaryScheduleEntityEdit historyOverview
Tables20/ ~13Diagrams3/ ~1Int. links15/ ~26Ext. links66/ ~16Footnotes0/ ~10References7/ ~10Quotes0Accuracy0RatingsN:5 R:7.5 A:6 C:8.5Backlinks2
Issues2
Links28 links could use <R> components
StaleLast edited 65 days ago - may need review

Sparse Autoencoders (SAEs)

Approach

Sparse Autoencoders (SAEs)

Comprehensive review of sparse autoencoders (SAEs) for mechanistic interpretability, covering Anthropic's 34M features from Claude 3 Sonnet (90% interpretability), OpenAI's 16M latent GPT-4 SAEs, DeepMind's 1T+ parameter Gemma Scope releases, and Goodfire's $50M Series A and 671B DeepSeek R1 SAEs. Despite promising safety applications including deception detection features, DeepMind's March 2025 negative results showed SAEs underperforming simple probes on downstream tasks. Global investment estimated at $75-150M/year with 150-200 researchers.

Related
Research Areas
Mechanistic Interpretability
Approaches
Representation Engineering
Organizations
GoodfireAnthropic
Risks
Deceptive Alignment
3.2k words · 2 backlinks

Overview

Sparse Autoencoders (SAEs) represent a breakthrough technique in mechanistic interpretability that addresses the fundamental challenge of neural network polysemanticity. In modern language models, individual neurons often respond to multiple unrelated concepts (e.g., a single neuron activating for both "the Golden Gate Bridge" and "requests for help"), making direct interpretation of neural activations extremely difficult. SAEs solve this by learning to decompose dense, polysemantic activations into sparse, monosemantic feature vectors where each dimension corresponds to a single interpretable concept.

The technique works by training an auxiliary neural network to reconstruct model activations through a bottleneck that encourages sparse representations. When trained on billions of activation samples, SAEs discover features that correspond to human-interpretable concepts ranging from concrete entities like "San Francisco" to abstract notions like "deception in political contexts." Anthropic's landmark 2024 work extracted over 34 million interpretable features from Claude 3 Sonnet, with automated evaluation finding that 90% of high-activating features have clear human-interpretable explanations.

For AI safety, SAEs offer a potentially transformative capability: the ability to directly detect safety-relevant cognition inside models. Researchers have identified features corresponding to lying, manipulation, security vulnerabilities, power-seeking behavior, and sycophancy. If SAE research scales successfully, it could provide the foundation for runtime monitoring systems that flag concerning internal states, deception detection during training, and verification that alignment techniques actually work at the mechanistic level.

Quick Assessment

DimensionAssessmentEvidence
Research InvestmentHigh ($50-100M/yr)Goodfire raised $50M Series A (April 2025); Anthropic, DeepMind, OpenAI dedicated teams
Feature Extraction Scale34M+ featuresAnthropic 2024 extracted 34M from Claude 3 Sonnet with 90% interpretability scores
Model Coverage100B+ parametersWorks on Claude 3 Sonnet, GPT-4, Gemma 2/3 up to 27B; Goodfire released SAEs for DeepSeek R1 (671B)
Safety ImpactLow-Medium (current)Promising but DeepMind deprioritized after SAEs underperformed linear probes on downstream safety tasks
Capability UpliftNeutralAnalysis tool only; does not improve model capabilities
Deception DetectionPartialSafety-relevant features identified (lying, sycophancy, power-seeking); causal validation 75-85% success
SI ReadinessUnknownDepends on whether interpretability scales to superintelligent systems; fundamental limits untested
GradeBHigh potential, strong feature discovery; unproven for operational safety applications
SourceLink
Official Websitecdn.openai.com
Wikipediaen.wikipedia.org
arXivarxiv.org

Research Comparison: Major SAE Studies

The following table compares key research efforts in sparse autoencoder development across major AI labs:

OrganizationPublicationModel TargetFeatures ExtractedKey FindingsScale/Cost
AnthropicMay 2024Claude 3 Sonnet34 million90% of high-activating features interpretable; safety-relevant features for deception, sycophancy identified8B activations; 16x expansion
OpenAIJune 2024GPT-416 millionSmooth scaling laws; k-sparse architecture eliminates dead latents; 10% compute equivalence loss40B tokens trained
DeepMindAugust 2024Gemma 2 (2B-27B)30+ millionJumpReLU architecture; open-source release for community research20 PiB activations; 15% of Gemma 2 9B training compute
DeepMindDecember 2024Gemma 3 (270M-27B)1+ trillion paramsCombines SAEs with transcoders; analyzes jailbreaks and chain-of-thought110 PB activation data
GoodfireJanuary 2025Llama 3.1 8B, Llama 3.3 70BNot disclosedState-of-the-art open-source SAEs; granular behavior steeringOpen-source release
GoodfireApril 2025DeepSeek R1 (671B)Not disclosedFirst SAEs on true reasoning model; qualitatively different from non-reasoning models671B parameter scale
EleutherAI2024GPT-2, open models1.5 millionAutomated interpretation at $1,300 (Llama 3.1) vs $8,500 (Claude 3.5); open tools97% cost reduction vs prior methods

Investment Landscape

OrganizationEstimated InvestmentTeam SizeFocus AreasKey Products/Releases
Anthropic$25-40M/year30-50 FTEScaling monosemanticity, circuit tracing, safety featuresClaude SAEs, Attribution Graphs
Google DeepMind$15-25M/year20-30 FTEOpen-source tools, benchmarking, architectural innovationGemma Scope 1 & 2 (110 PB data)
OpenAI$10-15M/year10-20 FTEScaling laws, k-sparse architectureGPT-4 SAEs, TopK methods
Goodfire$50M raised (Series A)15-25 FTECommercial interpretability, open-source modelsEmber platform, Llama/DeepSeek SAEs
Academic Sector$10-20M/year30-50 FTETheoretical foundations, benchmarking, applicationsMIT thesis work, InterpBench
Total Global$75-150M/year150-200 FTE

SAE Architecture Evolution

ArchitectureYearKey InnovationTrade-offs
Vanilla ReLU + L12023Original formulationDead latents; requires penalty tuning
Gated SAE2024Separate magnitude/selection pathsBetter reconstruction-sparsity frontier
JumpReLU2024Threshold activation functionState-of-the-art for Gemma Scope
BatchTopK2024Directly set sparsity without penaltyFew dead latents; training stability
Transcoders2025Predict next-layer activationsBetter for analyzing computations vs representations

How SAEs Work

Technical Architecture

SAEs are encoder-decoder neural networks trained to reconstruct activation vectors through a sparse intermediate representation:

Diagram (loading…)
flowchart TD
  ACT[Model Activations] --> ENC[Encoder]
  ENC --> SPARSE[Sparse Features]
  SPARSE --> DEC[Decoder]
  DEC --> RECON[Reconstructed Activations]

  SPARSE --> INTERP[Interpretable Concepts]
  INTERP --> SAFETY[Safety-Relevant Features]
  INTERP --> KNOWLEDGE[Factual Knowledge]
  INTERP --> REASONING[Reasoning Patterns]

  SAFETY --> DECEPTION[Deception Detection]
  SAFETY --> MANIPULATION[Manipulation]
  SAFETY --> HARMFUL[Harmful Intent]

  style ACT fill:#e1f5ff
  style SPARSE fill:#d4edda
  style SAFETY fill:#fff3cd
  style INTERP fill:#f0f0f0

The key innovation is the sparsity constraint: during training, the encoder is penalized for activating too many features simultaneously. This forces the network to find a small set of highly relevant features for any given input, naturally leading to monosemantic representations where each feature captures a distinct concept.

SAE Research Pipeline

The following diagram illustrates the complete pipeline from training SAEs to safety applications:

Diagram (loading…)
flowchart TD
  subgraph Training["Training Phase"]
      LM[Language Model] --> ACT[Collect Activations]
      ACT --> STORE[Store 10-100+ PB Data]
      STORE --> TRAIN[Train SAE]
      TRAIN --> FEAT[Extract Features]
  end

  subgraph Analysis["Analysis Phase"]
      FEAT --> AUTO[Automated Interpretation]
      AUTO --> VALID[Human Validation]
      VALID --> CATALOG[Feature Catalog]
  end

  subgraph Applications["Safety Applications"]
      CATALOG --> MONITOR[Runtime Monitoring]
      CATALOG --> STEER[Activation Steering]
      CATALOG --> VERIFY[Alignment Verification]
      CATALOG --> CIRCUIT[Circuit Analysis]
  end

  subgraph Tools["Research Tools"]
      CATALOG --> NEURON[Neuronpedia]
      CATALOG --> LENS[SAELens/TransformerLens]
  end

  style Training fill:#e1f5ff
  style Analysis fill:#d4edda
  style Applications fill:#fff3cd
  style Tools fill:#f0f0f0

Training Process

StageDescriptionComputational Cost
Activation CollectionRecord model activations on billions of tokensStorage-intensive (110 PB for Gemma Scope)
SAE TrainingTrain encoder-decoder with L1 sparsity penalty$1-10M compute for frontier models
Feature AnalysisAutomated labeling using interpretable AI$1,300-8,500 per 1.5M features
ValidationVerify features have causal effects via steeringMedium compute; manual effort

Key Technical Parameters

  • Expansion Factor: Ratio of SAE dictionary size to original activation dimensions (typically 4-64x)
  • Sparsity Penalty (L1): Strength of penalty on feature activation; higher values yield sparser but potentially less accurate reconstructions
  • Reconstruction Loss: How well SAE outputs match original activations; fundamental accuracy metric
  • Dead Features: Features that never activate; indicator of training difficulties

Major Research Milestones

Anthropic's Scaling Monosemanticity (May 2024)

The landmark result that demonstrated SAEs work at frontier model scale. This represented a major scaling milestone—eight months prior, Anthropic had only demonstrated SAEs on a small one-layer transformer, and it was unclear whether the method would scale to production models.

MetricResultContext
ModelClaude 3 Sonnet3.0 version released March 2024
Features Extracted1M, 4M, and 34MThree SAE sizes tested
Automated Interpretability Score90%High-activating features with clear explanations
Training Data8 billion residual-stream activationsFrom diverse text corpus
Expansion Factor83x to 2833xRatio of features to residual stream dimension
Average Features per Token≈300Sparse from thousands of dense activations

The resulting features exhibit remarkable abstraction: they are multilingual, multimodal, and generalize between concrete and abstract references. Critically, researchers found safety-relevant features including:

Feature CategoryExamples FoundSafety Relevance
DeceptionLying, dishonesty patternsDirect alignment concern
SecurityCode backdoors, vulnerabilitiesDual-use risk
ManipulationPersuasion, bias injectionInfluence operations
Power-seekingGoal-directed behavior patternsInstrumental convergence
SycophancyAgreement regardless of truthReward hacking indicator

DeepMind's Gemma Scope (August 2024) and Gemma Scope 2 (December 2024)

Gemma Scope represented the first major open-source SAE release, followed by the substantially larger Gemma Scope 2—described as the largest open-source interpretability release by an AI lab to date.

MetricGemma Scope (Aug 2024)Gemma Scope 2 (Dec 2024)
Model CoverageGemma 2 (2B, 9B, 27B)Gemma 3 (270M to 27B)
Total Features30+ millionComparable scale
Training Compute≈15% of Gemma 2 9B trainingNot disclosed
Storage Requirements20 PiB activations110 PB activation data
Total ParametersHundreds of billions1+ trillion across all SAEs
ArchitectureJumpReLU SAEsSAEs + Transcoders

Gemma Scope 2 introduces transcoders alongside SAEs—a key advancement that predicts next-layer activations rather than reconstructing current activations. This enables analysis of multi-step computations and behaviors like chain-of-thought faithfulness. The release specifically targets safety-relevant capabilities: analyzing jailbreak mechanisms, understanding refusal behaviors, and evaluating reasoning faithfulness.

2025 Developments: Expansion and Critical Assessment

2025 has seen both major expansion and critical reassessment of SAE research. Goodfire's $50M Series A in April 2025 represented the largest dedicated interpretability investment to date, while DeepMind's deprioritization announcement highlighted ongoing challenges.

DevelopmentDateSignificance
Goodfire SAEs for Llama 3.1/3.3January 2025First high-quality open-source SAEs for frontier Llama models
A Survey on Sparse AutoencodersMarch 2025Comprehensive academic review of SAE methods, training, and evaluation
DeepMind deprioritizationMarch 2025SAEs underperformed linear probes on safety tasks; team shifted focus
Goodfire $50M Series AApril 2025Largest dedicated interpretability funding; Ember platform expansion
Goodfire DeepSeek R1 SAEsApril 2025First SAEs on 671B reasoning model; revealed qualitative differences
Protein language model SAEs2025SAE techniques extended to biological foundation models

Negative Results and Limitations

DeepMind's March 2025 announcement that they were deprioritizing SAE research highlights important limitations:

FindingImplicationDeepMind Assessment
SAEs underperformed linear probes on harmful intent detectionSimpler methods may suffice for some safety applications"Linear probes are actually really good, cheap, and perform great"
Chat-specialized SAEs closed only ≈50% of the gapDomain-specific training helps but doesn't solve the problemStill worse than linear probes
Features may not be functionally importantInterpretable features ≠ causally relevant features"Do not think SAEs will be a game-changer"
High training costs with diminishing returnsCompute-intensive with unclear safety ROITeam shifted to model diffing, deception model organisms

However, DeepMind noted SAEs remain "fairly helpful for debugging low quality datasets (noticing spurious correlations)" and left open the possibility of returning to SAE research "if there is significant progress on some of SAEs' core issues."

Quantified Performance Metrics

MetricValueSourceNotes
Feature interpretability rate90%Anthropic 2024High-activating features with clear explanations
Average features per token≈300Anthropic 2024Sparse representation from thousands of dense activations
Reconstruction loss (GPT-4 SAE)10% compute equivalentOpenAI 2024Language modeling loss increase when SAE substituted
Automated interpretation cost$1,300-8,500 per 1.5M featuresEleutherAI 2024Llama 3.1 vs Claude 3.5 Sonnet
Prior interpretation methods≈$200,000 per 1.5M featuresEleutherAI 202497% cost reduction achieved
Storage requirements (Gemma Scope 2)110 PBDeepMind 2024Largest open-source interpretability release
Dead latent rate (TopK architecture)Near zeroOpenAI 2024vs significant dead features with ReLU+L1

Activation Steering Applications

SAE features enable direct intervention in model behavior by clamping or modifying specific feature activations during inference. This "steering" capability represents a practical application beyond interpretability:

Steering ApplicationMethodEffectivenessTrade-offs
Refusal steeringAmplify refusal-mediating featuresImproves jailbreak resistanceDegrades general capabilities
SAE-Targeted SteeringOptimize steering vectors for specific effectsOutperforms baseline methodsRequires feature identification
Feature Guided Activation AdditionsSAE-guided vector constructionBetter coherence than CAAMore complex pipeline
Graph-regularized SAEs (GSAE)Spectral vector bank with dual-gatingSelective, stable steeringAdded architectural complexity

Research has uncovered a fundamental tension: features mediating safety behaviors like refusal appear entangled with general capabilities. Steering for improved safety often degrades benchmark performance, suggesting safety-relevant features may not be cleanly separable from capability-relevant ones.

Safety Applications

Deception Detection

SAEs could enable direct detection of deceptive cognition by identifying when deception-related features activate:

ApplicationMechanismCurrent Status
Runtime MonitoringFlag when deception features activate during inferenceTheoretical; not deployed
Training-time DetectionIdentify deceptive patterns during fine-tuningExperimental research
Alignment VerificationConfirm models have learned intended valuesEarly-stage research
Red-teaming AugmentationFind adversarial prompts that activate concerning featuresGrowing usage

Limitations for Safety

Even with successful SAE development, fundamental challenges remain:

  • Coverage: SAEs may not capture all safety-relevant features
  • Adversarial Robustness: Sophisticated models might learn to hide concerning cognition
  • Interpretation Accuracy: Human labels may not capture true feature meanings
  • Causal Relevance: Features that activate for deception may not cause deceptive outputs

Key Research Groups

OrganizationFocusKey ContributionsScale
AnthropicLeading SAE development and scalingScaling Monosemanticity; Claude SAEs; original monosemanticity work34M features; frontier models
OpenAIScaling methodologyGPT-4 SAEs; TopK architecture; scaling laws research16M latents; 40B token training
DeepMindOpen-source tools and benchmarkingGemma Scope 1 & 2; JumpReLU; negative results1T+ parameters; 110 PB data
GoodfireCommercial interpretability, open-source SAEsDeepSeek R1 SAEs; Llama SAEs; Ember platform$50M Series A; 671B model scale
EleutherAIOpen-source interpretabilityAutomated interpretation; Delphi; Sparsify; cost reduction97% cost reduction
MATS AlumniFoundational researchOriginal SAE paper; SAELens; community toolsOpen-source ecosystem
NeuronpediaVisualization and toolingFeature explorer; 50M+ searchable latents; API access4+ TB of data hosted

Research Ecosystem

Diagram (loading…)
flowchart TD
  subgraph Labs["Frontier Labs"]
      ANTH[Anthropic]
      OAI[OpenAI]
      GDM[Google DeepMind]
  end

  subgraph Commercial["Commercial Interpretability"]
      GF[Goodfire]
      NP[Neuronpedia]
  end

  subgraph OpenSource["Open-Source Ecosystem"]
      ELEU[EleutherAI]
      MATS[MATS Alumni]
      LENS[SAELens/TransformerLens]
  end

  subgraph Applications["Safety Applications"]
      DEC[Deception Detection]
      STEER[Activation Steering]
      MONITOR[Runtime Monitoring]
      VERIFY[Alignment Verification]
  end

  ANTH --> |34M features| GF
  ANTH --> |Methods| MATS
  GDM --> |Gemma Scope| ELEU
  OAI --> |TopK SAEs| LENS
  GF --> |Ember platform| Applications
  ELEU --> |Delphi| Applications
  NP --> |50M+ latents| Applications

  style Labs fill:#e1f5ff
  style Commercial fill:#d4edda
  style OpenSource fill:#fff3cd
  style Applications fill:#f8d7da

Arguments For Prioritization

  1. Unique Capability: SAEs may be necessary for detecting sophisticated deception that behavioral evals cannot catch
  2. No Capability Uplift: Pure safety research with minimal dual-use concerns
  3. Proven at Scale: Works on 100B+ parameter models, suggesting path to frontier
  4. Foundation for Other Work: Enables representation engineering, activation steering, and monitoring

Arguments Against Prioritization

  1. May Not Scale: Fundamental limits on interpretability possible
  2. Expensive: Significant compute and researcher time required
  3. Limited Safety Impact So Far: No operational safety applications despite years of research
  4. Alternative Approaches: Linear probes and behavioral methods may be more cost-effective
  5. False Confidence Risk: Partial interpretability might create false assurance

Key Uncertainties

UncertaintyCurrent EvidenceImportanceResolution Timeline
Causal relevance of featuresMixed; steering works but effects entangledCritical for safety applications2025-2027
Adversarial robustnessUntested; models could learn to evade feature detectionHigh for deploymentUnknown
Coverage completenessCurrent SAEs capture subset of model behaviorMedium; partial coverage may suffice2025-2026
Scaling to superintelligent systemsNo evidence; extrapolation uncertainVery high for long-term safetyDepends on AI timeline
Transcoders vs SAEsEarly evidence favors transcoders for some applicationsMedium; may be complementary2025
Feature universality across modelsSimilar features found across architecturesMedium for transfer learning2025-2026

Risks Addressed

RiskMechanismEffectiveness
Deceptive AlignmentDetect deception-related features during inferenceMedium-High (if scalable)
SchemingFind evidence of strategic deception in internal representationsMedium-High
Mesa-OptimizationIdentify mesa-objectives in model internalsMedium
Reward HackingDetect proxy optimization vs. true goalsMedium
Goal MisgeneralizationUnderstand learned goal representationsMedium

Recommendation

Recommendation Level: INCREASE

SAEs represent one of the most promising technical approaches to the fundamental problem of understanding AI cognition. While current safety applications remain limited, the potential for detecting sophisticated deception justifies increased investment. The technique has no meaningful capability uplift, making it a safe area for expanded research.

Priority areas for additional investment:

  • Scaling to larger models and more comprehensive feature coverage
  • Developing robust automated evaluation methods
  • Building operational monitoring systems based on SAE features
  • Investigating adversarial robustness of SAE-based detection
  • Interpretability - Parent field; SAEs are a key technique within mechanistic interpretability
  • Probing / Linear Probes - Alternative method using linear probes; DeepMind found probes outperform SAEs on some tasks
  • Representation Engineering - Uses SAE-discovered features for activation steering
  • AI Control - Complementary defense-in-depth approach if interpretability misses deception

Sources & Resources

Primary Research

SourceOrganizationDateKey Contribution
Scaling MonosemanticityAnthropicMay 202434M features from Claude 3 Sonnet; safety-relevant feature discovery
Extracting Concepts from GPT-4OpenAIJune 202416M latent SAE; scaling laws; TopK architecture
Scaling and Evaluating Sparse AutoencodersOpenAIJune 2024Technical methodology paper; k-sparse autoencoders
Gemma ScopeDeepMindAugust 2024Open-source SAEs for Gemma 2; JumpReLU architecture
Gemma Scope 2DeepMindDecember 2024Largest open release; SAEs + transcoders for Gemma 3
Sparse Autoencoders Find Highly Interpretable FeaturesAnthropic/MATSSeptember 2023Foundational SAE methodology paper

Critical Perspectives and Limitations

SourceOrganizationDateKey Finding
Negative Results for SAEs on Downstream TasksDeepMindMarch 2025SAEs underperform linear probes; led to SAE research deprioritization
Open Source Automated InterpretabilityEleutherAI202497% cost reduction for feature interpretation; open tools
Transcoders Beat Sparse AutoencodersVariousJanuary 2025Skip transcoders Pareto-dominate SAEs for interpretability

Tools and Platforms

ToolURLDescription
Neuronpedianeuronpedia.orgInteractive SAE feature explorer; 50M+ searchable latents; live inference testing
SAELensgithub.com/jbloomAus/SAELensSAE training library; supports multiple architectures
TransformerLensgithub.com/neelnanda-io/TransformerLensInterpretability library with SAE integration
Delphigithub.com/EleutherAI/delphiAutomated feature interpretation pipeline
EleutherAI Sparsifygithub.com/EleutherAI/sparsifyOn-the-fly activation training without caching

Foundational Reading


References

Anthropic applies sparse autoencoders (SAEs) to extract millions of interpretable, monosemantic features from Claude 3 Sonnet, a large production-scale language model. The work demonstrates that features learned at scale are human-interpretable, multimodal, and exhibit meaningful geometric structure including concept hierarchies. This represents a major step toward mechanistic interpretability of frontier AI systems.

★★★★☆
2deprioritizing SAE researchMedium·Blog post

DeepMind's mechanistic interpretability team reports that sparse autoencoders (SAEs) underperformed simpler linear probes on out-of-distribution detection of harmful intent in user prompts. Based on these negative results and parallel work, the team has decided to deprioritize fundamental SAE research. The post also highlights that linear probes are cheap, effective alternatives for this downstream safety task.

★★☆☆☆

This paper addresses polysemanticity in neural networks—where individual neurons activate across multiple unrelated contexts—by proposing sparse autoencoders to identify interpretable features in language models. The authors hypothesize that polysemanticity arises from superposition, where networks represent more features than neurons by using overcomplete directions in activation space. Their sparse autoencoder approach successfully recovers monosemantic (single-meaning) features that are more interpretable than existing methods, and demonstrates causal interpretability by identifying which features drive specific model behaviors on the indirect object identification task. This scalable, unsupervised method offers a foundation for mechanistic interpretability research and improved model transparency.

★★★☆☆

OpenAI researchers present work on extracting human-interpretable concepts from GPT-4's internal representations using sparse autoencoders or similar dictionary learning methods. The research aims to identify meaningful features encoded in the model's activations, advancing mechanistic interpretability of large language models.

★★★★☆

Gemma Scope 2 is DeepMind's updated suite of sparse autoencoders (SAEs) trained on Gemma language models, released to help the AI safety research community better interpret and understand internal representations in large language models. Building on the original Gemma Scope release, it expands coverage and capability to enable more detailed mechanistic interpretability research. The tool is designed to lower barriers for researchers studying features, circuits, and complex behaviors in transformer models.

★★★★☆

EleutherAI introduces an open-source pipeline for automated interpretability of neural network features, particularly targeting sparse autoencoder (SAE) features. The project automates the process of generating natural language explanations for model internals and scoring their quality, making mechanistic interpretability research more scalable and accessible. It builds on prior work like OpenAI's automated interpretability but releases tooling publicly.

7Sparse autoencoders uncover biologically interpretable features in protein language model representationsPNAS (peer-reviewed)·Onkar Gujral, Mihir Bafna, Eric Alm & Bonnie Berger·2025
★★★★★

Related Wiki Pages

Top Related Pages

Safety Research

Anthropic Core Views

Risks

Reward HackingSchemingGoal MisgeneralizationSycophancyMesa-Optimization

Analysis

Model Organisms of MisalignmentCapability-Alignment Race Model

Approaches

Probing / Linear ProbesSleeper Agent Detection

Other

InterpretabilityAI ControlNeel NandaChris Olah

Organizations

OpenAIGoogle DeepMind

Concepts

Alignment Interpretability OverviewDense Transformers

Key Debates

AI Alignment Research AgendasTechnical AI Safety Research