Skip to content
Longterm Wiki
Navigation
Updated 2026-02-12HistoryData
Citations verified36 accurate4 flagged9 unchecked
Page StatusContent
Edited 7 weeks ago2.4k words5 backlinksUpdated every 3 weeksOverdue by 31 days
68QualityGood •85.5ImportanceHigh56ResearchModerate
Content10/13
SummaryScheduleEntityEdit history1Overview
Tables3/ ~10Diagrams0/ ~1Int. links20/ ~19Ext. links2/ ~12Footnotes50/ ~7References24/ ~7Quotes41/50Accuracy40/50RatingsN:6 R:7 A:6 C:8Backlinks5
Change History1
Surface tacticalValue in /wiki table and score 53 pages7 weeks ago

Added `tacticalValue` to `ExploreItem` interface, `getExploreItems()` mappings, the `/wiki` explore table (new sortable "Tact." column), and the card view sort dropdown. Scored 49 new pages with tactical values (4 were already scored), bringing total to 53.

sonnet-4 · ~30min

Issues1
QualityRated 68 but structure suggests 93 (underrated by 25 points)

Goodfire

Safety Org

Goodfire

Goodfire is a well-funded AI interpretability startup valued at $1.25B (Feb 2026) developing mechanistic interpretability tools like Ember API to make neural networks more transparent and steerable. The company's pivot toward using interpretability in model training ("intentional design") has sparked significant AI safety community debate about whether this compromises interpretability as an independent safety tool.

TypeSafety Org
Related
Organizations
AnthropicOpenAIGoogle DeepMind
People
Dario AmodeiChris Olah
2.4k words · 5 backlinks

Quick Assessment

DimensionAssessment
FoundedJune 2024
TypePublic benefit corporation, AI interpretability research lab
LocationSan Francisco, California
Funding$209M+ (Seed: $7M Aug 2024, Series A: $50M Apr 2025, Series B: $150M Feb 2026)
Valuation$1.25B (as of Series B, Feb 2026)
Employees≈39 (as of early 2026)
Key ProductEmber (mechanistic interpretability API and platform)
FocusMechanistic interpretability, sparse autoencoders, AI safety
Notable BackersAnthropic (first direct investment), Menlo Ventures, Lightspeed Venture Partners
SourceLink
Official Websitegoodfire.ai
Wikipediaen.wikipedia.org

Overview

Goodfire is an AI interpretability research lab and public benefit corporation specializing in mechanistic interpretability—the science of reverse-engineering neural networks to understand and control their internal workings.1 Founded in June 2024 by Eric Ho (CEO), Dan Balsam (CTO), and Tom McGrath (Chief Scientist), the company aims to transform opaque AI systems into transparent, steerable, and safer technologies.2

The company's flagship product, Ember, is the first hosted mechanistic interpretability API, providing researchers and developers with programmable access to AI model internals.3 Rather than treating models as black boxes, Ember enables users to examine individual "features" (interpretable patterns of neural activation), edit model behavior without retraining, and audit for safety issues before deployment. The platform supports models like Llama 3.3 70B and processes tokens at a rate that has tripled monthly since launch in December 2024.4

Goodfire's rapid ascent reflects growing industry recognition that interpretability is foundational to AI safety. The company raised $50 million in Series A funding less than one year after founding, led by Menlo Ventures with participation from Anthropic—marking Anthropic's first direct investment in another company.5 Anthropic CEO Dario Amodei stated that "mechanistic interpretability is among the best bets to help us transform black-box neural networks into understandable, steerable systems."6 In February 2026, Goodfire raised a further $150 million in Series B funding at a $1.25 billion valuation, led by B Capital with participation from Salesforce Ventures, Eric Schmidt, and existing investors.7

History and Founding

The founding team brought complementary expertise from both entrepreneurship and frontier AI research. Eric Ho and Dan Balsam had previously co-founded RippleMatch in 2016, an AI-powered hiring platform that Ho scaled to over $10 million in annual recurring revenue.8 Ho's work at RippleMatch earned him recognition on Forbes's 30 Under 30 list in 2022.9

Tom McGrath, the company's Chief Scientist, is recognized as a pioneering figure in mechanistic interpretability. He completed his PhD in 2016 and co-founded the Interpretability team at Google DeepMind, where he served as a Senior Research Scientist.10 In March 2024, McGrath left Google to join South Park Commons, a community for technologists, with the explicit goal of making interpretability "useful" by starting a company.11 He connected with Ho and Balsam shortly thereafter to launch Goodfire.

The company's founding in June 2024 was followed quickly by a $7 million seed round in August 2024, led by Lightspeed Venture Partners with participation from Menlo Ventures, South Park Commons, Work-Bench, and others.12 Less than one year later, in April 2025, Goodfire announced its $50 million Series A at a $200 million valuation.13 In February 2026, the company closed a $150 million Series B at a $1.25 billion valuation, led by B Capital with participation from DFJ Growth, Salesforce Ventures, Eric Schmidt, and existing investors including Menlo Ventures and Lightspeed Venture Partners.7

Team and Expertise

Beyond the three founders, Goodfire has assembled a team of leading researchers from OpenAI and DeepMind. The team includes:14

  • Lee Sharkey: Pioneered the use of sparse autoencoders in language models and co-founded Apollo Research
  • Nick Cammarata: Started the interpretability team at OpenAI and worked closely with Chris Olah, widely considered the founder of mechanistic interpretability

The team's collective contributions include authoring the three most-cited papers in mechanistic interpretability and pioneering techniques like sparse autoencoders (SAEs) for feature discovery, auto-interpretability methods, and knowledge extraction from models like AlphaZero.15

Technology and Products

Ember Platform

Ember is Goodfire's core product—a mechanistic interpretability API that provides direct, programmable access to AI model internals.16 Unlike traditional approaches that treat models as black boxes accessible only through prompts, Ember allows users to:

  • Examine features: Identify interpretable patterns in neural activations (e.g., features representing "professionalism," "sarcasm," or specific knowledge domains)
  • Steer behavior: Adjust feature activations to control model outputs without retraining or complex prompt engineering (e.g., making a model more "wise sage"-like by amplifying philosophical reasoning features)17
  • Debug and audit: Trace decision pathways, detect biases, identify vulnerabilities, and uncover hidden knowledge
  • Model diffing: Track changes across training checkpoints to understand why problematic behaviors emerge18

The platform is model-agnostic and currently supports models including Llama 3.3 70B and Llama 3.1 8B. Token processing has tripled monthly since launch, with hundreds of researchers using the platform.19

Auto Steer Method

Goodfire developed an "Auto Steer" method for automated behavioral adjustments. Independent evaluation found it effective for certain behavioral objectives (like "be professional") but noted a coherence gap—outputs sometimes became less coherent compared to traditional prompt engineering.20 This highlights the practical challenges of translating interpretability research into production-ready tools.

Safety Applications

Goodfire emphasizes safety-first applications of interpretability:21

  • Auditing: Probing model behaviors to identify misalignment, biases, and vulnerabilities
  • Conditional steering: Preventing jailbreaks by applying context-dependent behavioral controls (tested on the StrongREJECT adversarial dataset)
  • Model diffing: Detecting how and why unsafe behaviors emerge during training or fine-tuning
  • PII detection: Partnering with Rakuten to use sparse autoencoder probes to prevent personally identifiable information leakage22

Pre-release safety measures for Ember include feature moderation (removing harmful/explicit/malicious features), input/output filtering, and controlled access for researchers.23

Intentional Design

In February 2026, Goodfire announced a broader vision called "intentional design"—using interpretability to guide model training rather than merely analyzing models post-hoc.24 The approach involves decomposing what a model learns from each datapoint into semantic components, then selectively applying or filtering these learning signals. Goodfire claims this method enabled them to cut hallucinations in half using interpretability-informed training.7 The approach has generated significant debate in the AI safety community (see Criticisms and Concerns).

Partnerships and Impact

Goodfire has established collaborations with research institutions and industry partners:

  • Arc Institute: Early collaboration using Ember on Evo 2, a DNA foundation model, to uncover biological concepts and accelerate scientific discovery in genomics.25
  • Mayo Clinic: Announced in September 2025, focusing on genomic medicine, reverse-engineering genomics models for insights into disease mechanisms while emphasizing data privacy and bias reduction.26
  • Rakuten: Enhancing reliability for Rakuten AI, which serves over 44 million monthly users in Japan and 2 billion customers worldwide, focusing on preventing PII leakage using frontier interpretability techniques.27
  • Haize Labs: Joint work on feature steering for AI safety auditing, red-teaming, and identifying failure modes in generative models.28
  • Apollo Research: Using Goodfire tools for safety benchmarks and research.29
  • Microsoft: Partnership announced alongside the Series B funding round in February 2026.7

In November 2024, Goodfire powered the "Reprogramming AI Models" hackathon in partnership with Apart Research, with over 200 researchers across 15 countries prototyping safety applications like adversarial attack detection and "unlearning" harmful capabilities while preserving beneficial behaviors.30

A notable scientific achievement came from Goodfire's partnership with Arc Institute: by reverse-engineering a biological foundation model, the team identified a novel class of Alzheimer's biomarkers—described as "the first major finding in the natural sciences obtained from reverse-engineering a foundation model."31

AI Safety and Alignment

Goodfire positions mechanistic interpretability as foundational to AI alignment and safety. The company's approach addresses several key challenges:

Alignment Without Side Effects

Traditional alignment methods like reinforcement learning from human feedback (RLHF) can produce unintended side effects, such as excessive refusal of benign requests or sycophantic behavior.32 Goodfire's feature steering offers an alternative by enabling precise, quantitative alignment of specific behaviors without degrading overall model performance.

Detecting Deception and Hidden Behaviors

One of the central challenges in AI safety is detecting deceptive or scheming behavior in advanced AI systems. Goodfire's model diffing and auditing tools aim to identify rare, undesired behaviors—such as a model encouraging self-harm—that might emerge during training or deployment.33 However, there is ongoing debate within the interpretability community about whether these techniques will scale to worst-case scenarios involving sophisticated deception.34

Governance and Compliance

Interpretability tools like Ember may become essential for regulatory compliance. The EU AI Act mandates transparency for high-risk AI systems, with fines up to €20 million for violations.35 Goodfire's auditing and documentation capabilities could help organizations meet these requirements.

Fellowship Program

In October 2025, Goodfire announced a Fellowship Program for early- and mid-career researchers and engineers, matched with senior researchers to work on scientific discovery, interpretable models, and new interpretability methods.36

Criticisms and Concerns

Despite significant progress, Goodfire's approach faces several challenges and critiques:

Unproven Effectiveness in Worst-Case Scenarios

There is substantial debate about whether mechanistic interpretability can reliably detect deception in advanced AI systems. Researcher Neel Nanda has noted that interpretability lacks "ground truth" for what concepts AI models actually use, making it difficult to validate interpretability claims.37 Some researchers favor alternative methods like linear probes.

A concrete example of interpretability's limitations emerged with GPT-4o's "extreme sycophancy" issue, which was detected behaviorally rather than through mechanistic analysis—no circuit was discovered, no particular weights or activations were identified as responsible, and mechanistic interpretability provided no advance warning.38

Competition from In-House Development

Leading AI labs like Anthropic, OpenAI, and DeepMind have the resources to develop interpretability tools internally. Anthropic has publicly committed to investing significantly in reliably detecting AI model problems by 2027.39 Additionally, open-source alternatives like Eleuther AI and InterpretML provide free interpretability frameworks, creating competitive pressure on commercial offerings.

Computational Cost and Accessibility

Goodfire's pricing model is heavily compute-bound, with strict rate limits that may limit accessibility.40 The platform enforces a 50,000 token/minute global cap shared across all API methods. Advanced interpretability functions like AutoSteer and AutoConditional are limited to just 30 requests/minute, while simpler utilities allow 1,000 requests/minute. This hierarchy suggests exponentially higher computational costs for core interpretability features, potentially creating barriers for smaller organizations and academic researchers.

"The Most Forbidden Technique" Debate

In February 2026, Goodfire Chief Scientist Thomas McGrath published "Intentionally Designing the Future of AI," proposing the use of interpretability tools to shape model training by decomposing gradients into semantic components and selectively applying them on a per-datapoint basis.24 This reignited a significant debate in the AI safety community about what Zvi Mowshowitz termed "The Most Forbidden Technique"—using interpretability techniques during training.41

Critics argue that optimizing against interpretability signals during training teaches models to obfuscate their internal representations, ultimately degrading the very tools needed to detect misalignment.42 As Mowshowitz summarized: if you train against technique [T], "you are training the AI to obfuscate its thinking, and defeat [T]." A LessWrong post specifically questioning Goodfire's approach noted that even structurally different methods like gradient decomposition may not escape this fundamental dynamic, since "selection pressure just goes into the parts that you don't know about or don't completely understand."43

Defenders, including Neel Nanda, argued that this research direction is both legitimate and potentially critical for safety. Nanda noted that multiple researchers including Anthropic Fellows have worked on interpretability-in-training, and that understanding its risks and benefits requires empirical research rather than blanket prohibition.44 He acknowledged key uncertainties: "I don't know how well it will work, how much it will break interpretability tools, or which things are more or less dangerous."

The debate took a personal dimension when founding research scientist Liv Gorton departed Goodfire in early 2026, with AI safety advocate Holly Elmore publicly speculating the departure was "for reasons of conscience."45 Gorton's departure—she had co-authored key research including the first sparse autoencoders on DeepSeek R1—highlighted the tensions between commercial applications of interpretability and its role as an independent safety tool.

Capabilities vs. Safety Framing

Third-hand reports indicate that Goodfire leadership has pitched interpretability work as "capabilities-enhancing" (improving AI performance) rather than primarily safety-focused when fundraising.46 This framing raises questions about whether commercial incentives might prioritize performance improvements over safety applications—a tension common in dual-use AI research. The company's Series B announcement emphasized that interpretability-informed training had "cut hallucinations in half," framing the technology as a capabilities improvement.7

Funding and Business Model

Goodfire has raised approximately $209 million across three rounds:47

RoundDateAmountLead InvestorKey ParticipantsValuation
SeedAugust 2024$7MLightspeed Venture PartnersMenlo Ventures, South Park Commons, Work-Bench, Juniper Ventures, Mythos Ventures, Bluebirds CapitalN/A
Series AApril 2025$50MMenlo VenturesAnthropic, Lightspeed Venture Partners, B Capital, Work-Bench, Wing Ventures, South Park Commons, Metaplanet, Halcyon Ventures$200M
Series BFebruary 2026$150MB CapitalJuniper Ventures, DFJ Growth, Salesforce Ventures, Menlo Ventures, Lightspeed Venture Partners, South Park Commons, Wing Venture Capital, Eric Schmidt$1.25B

The Series A round marked Anthropic's first direct investment in another company, signaling significant industry validation.48 The Series B, closing less than a year later, valued Goodfire at $1.25 billion—making it one of the fastest AI startups to reach unicorn status.7

Goodfire operates on a usage-based pricing model, charging per million tokens processed (input + output), with pricing tiered by model size:49

  • Smaller models (e.g., Llama 3.1 8B): $0.35/million tokens
  • Larger models (e.g., Llama 3.3 70B): $1.90/million tokens

The company is positioned to capture value in a rapidly growing market. The explainable AI market was valued at approximately $10 billion in 2025 and is projected to reach $25 billion by 2030.50

Key Uncertainties

  1. Scalability to superintelligent systems: Will mechanistic interpretability techniques that work on current models continue to provide safety guarantees as AI systems become more powerful and potentially deceptive?

  2. Commercial viability: Can Goodfire compete with in-house interpretability teams at well-resourced AI labs and free open-source alternatives?

  3. Capabilities vs. safety trade-offs: How will Goodfire navigate the tension between interpretability as a safety tool versus a capabilities enhancement that could accelerate AI development?

  4. Ground truth validation: Without definitive ground truth about what concepts models represent, how can interpretability claims be rigorously validated?

  5. Computational economics: Can the high computational costs of mechanistic interpretability be reduced sufficiently to enable widespread adoption?

  6. Training on interpretability: Will using interpretability tools during model training ultimately compromise those tools' ability to serve as independent safety checks? This remains the central open question around Goodfire's "intentional design" approach.

Sources

Footnotes

  1. Goodfire company overviewGoodfire company overview

  2. Goodfire company websiteGoodfire company website

  3. Citation rc-2fad

  4. Contrary Research: GoodfireContrary Research: Goodfire

  5. Contrary Research: GoodfireContrary Research: Goodfire

  6. PRNewswire: Goodfire Raises $50M Series APRNewswire: Goodfire Raises $50M Series A

  7. Goodfire blog: Understanding, Learning From, and Designing AI: Our Series BGoodfire blog: Understanding, Learning From, and Designing AI: Our Series B 2 3 4 5 6

  8. Contrary Research: GoodfireContrary Research: Goodfire

  9. Contrary Research: GoodfireContrary Research: Goodfire

  10. Contrary Research: GoodfireContrary Research: Goodfire

  11. Contrary Research: GoodfireContrary Research: Goodfire

  12. Contrary Research: GoodfireContrary Research: Goodfire

  13. Contrary Research: GoodfireContrary Research: Goodfire

  14. Menlo Ventures: Leading Goodfire's $50M Series AMenlo Ventures: Leading Goodfire's $50M Series A

  15. Contrary Research: GoodfireContrary Research: Goodfire

  16. Super B Crew: Goodfire Raises $50MSuper B Crew: Goodfire Raises $50M

  17. EA Forum: Goodfire — The Startup Trying to Decode How AI ThinksEA Forum: Goodfire — The Startup Trying to Decode How AI Thinks

  18. EA Forum: Goodfire — The Startup Trying to Decode How AI ThinksEA Forum: Goodfire — The Startup Trying to Decode How AI Thinks

  19. Contrary Research: GoodfireContrary Research: Goodfire

  20. Alignment Forum: Mind the Coherence GapAlignment Forum: Mind the Coherence Gap

  21. Goodfire blog: Our Approach to SafetyGoodfire blog: Our Approach to Safety

  22. Goodfire customer story: RakutenGoodfire customer story: Rakuten

  23. Goodfire blog: Our Approach to SafetyGoodfire blog: Our Approach to Safety

  24. Goodfire blog: Intentionally Designing the Future of AIGoodfire blog: Intentionally Designing the Future of AI 2

  25. Citation rc-72bf

  26. Goodfire blog: Mayo Clinic collaborationGoodfire blog: Mayo Clinic collaboration

  27. Goodfire customer story: RakutenGoodfire customer story: Rakuten

  28. Goodfire blog: Our Approach to SafetyGoodfire blog: Our Approach to Safety

  29. Goodfire blog: Announcing Goodfire EmberGoodfire blog: Announcing Goodfire Ember

  30. Citation rc-a83a

  31. Silicon Valley Daily: AI Research Lab Goodfire Scores $125 MillionSilicon Valley Daily: AI Research Lab Goodfire Scores $125 Million

  32. EA Forum: Goodfire — The Startup Trying to Decode How AI ThinksEA Forum: Goodfire — The Startup Trying to Decode How AI Thinks

  33. Goodfire research: Model Diff AmplificationGoodfire research: Model Diff Amplification

  34. Contrary Research: GoodfireContrary Research: Goodfire

  35. AE Studio: AI AlignmentAE Studio: AI Alignment

  36. Goodfire blog: Fellowship Fall 25Goodfire blog: Fellowship Fall 25

  37. Contrary Research: GoodfireContrary Research: Goodfire

  38. Stanford CGPotts blog: InterpretabilityStanford CGPotts blog: Interpretability

  39. Contrary Research: GoodfireContrary Research: Goodfire

  40. Contrary Research: GoodfireContrary Research: Goodfire

  41. Zvi Mowshowitz: The Most Forbidden TechniqueZvi Mowshowitz: The Most Forbidden Technique

  42. LessWrong: The Most Forbidden TechniqueLessWrong: The Most Forbidden Technique

  43. LessWrong: Goodfire and Training on InterpretabilityLessWrong: Goodfire and Training on Interpretability

  44. Alignment Forum: It Is Reasonable To Research How To Use Model Internals In TrainingAlignment Forum: It Is Reasonable To Research How To Use Model Internals In Training

  45. Holly Elmore on X: Liv Gorton departure from GoodfireHolly Elmore on X: Liv Gorton departure from Goodfire

  46. EA Forum: Goodfire — The Startup Trying to Decode How AI ThinksEA Forum: Goodfire — The Startup Trying to Decode How AI Thinks

  47. Contrary Research: GoodfireContrary Research: Goodfire

  48. Contrary Research: GoodfireContrary Research: Goodfire

  49. Contrary Research: GoodfireContrary Research: Goodfire

  50. Contrary Research: GoodfireContrary Research: Goodfire

References

A tweet by Holly Elmore commenting on Liv Gorton's departure from Goodfire, an AI interpretability startup. The post likely reflects on organizational dynamics or personnel changes within the AI safety and interpretability research community.

Claims (1)
The debate took a personal dimension when founding research scientist Liv Gorton departed Goodfire in early 2026, with AI safety advocate Holly Elmore publicly speculating the departure was "for reasons of conscience." Gorton's departure—she had co-authored key research including the first sparse autoencoders on DeepSeek R1—highlighted the tensions between commercial applications of interpretability and its role as an independent safety tool.

Goodfire, an AI interpretability startup founded by alumni from OpenAI and Google DeepMind, announced a $50M Series A led by Menlo Ventures with participation from Anthropic and others. The company is developing Ember, a platform that decodes neural network internals to make AI systems understandable, steerable, and fixable. Early applications include collaboration with the Arc Institute to extract biological insights from the Evo 2 DNA foundation model.

Claims (2)
The company raised \$50 million in Series A funding less than one year after founding, led by Menlo Ventures with participation from Anthropic—marking Anthropic's first direct investment in another company. Anthropic CEO Dario Amodei stated that "mechanistic interpretability is among the best bets to help us transform black-box neural networks into understandable, steerable systems." In February 2026, Goodfire raised a further \$150 million in Series B funding at a \$1.25 billion valuation, led by B Capital with participation from Salesforce Ventures, Eric Schmidt, and existing investors.
Accurate100%Feb 22, 2026
"As AI capabilities advance, our ability to understand these systems must keep pace. Our investment in Goodfire reflects our belief that mechanistic interpretability is among the best bets to help us transform black-box neural networks into understandable, steerable systems—a critical foundation for the responsible development of powerful AI," said Dario Amodei , CEO and Co-Founder of Anthropic.
- Arc Institute: Early collaboration using Ember on Evo 2, a DNA foundation model, to uncover biological concepts and accelerate scientific discovery in genomics.
Accurate100%Feb 22, 2026
"Partnering with Goodfire has been instrumental in unlocking deeper insights from Evo 2, our DNA foundation model," said Patrick Hsu , co-founder of Arc Institute – one of Goodfire's earliest collaborators. "Their interpretability tools have enabled us to extract novel biological concepts that are accelerating our scientific discovery process."
3Alignment Forum: Mind the Coherence GapAlignment Forum·eitan sprejer·2025·Blog post

This post investigates the 'coherence gap'—the disconnect between successfully steering individual features in LLMs and achieving coherent, consistent behavioral changes across the whole model. Using Goodfire's Auto Steer tool with sparse autoencoders on Llama models, the authors show that feature-level interventions frequently produce incoherent outputs, contradictions, and unintended side effects. The findings highlight fundamental limitations of mechanistic feature steering as a path to reliable model alignment.

★★★☆☆
Claims (1)
Independent evaluation found it effective for certain behavioral objectives (like "be professional") but noted a coherence gap—outputs sometimes became less coherent compared to traditional prompt engineering. This highlights the practical challenges of translating interpretability research into production-ready tools.

Christopher Potts (Stanford/Goodfire) systematically addresses common dismissive arguments against interpretability research, arguing that such dismissal is historically ill-conceived given how many once-marginal AI areas became central to the field. He counters three main skeptical positions—that interpretability is fundamentally unachievable, that analysis is overrated in engineering, and that interpretability lacks practical utility—advocating for claim-by-claim engagement rather than wholesale dismissal.

Claims (1)
A concrete example of interpretability's limitations emerged with GPT-4o's "extreme sycophancy" issue, which was detected behaviorally rather than through mechanistic analysis—no circuit was discovered, no particular weights or activations were identified as responsible, and mechanistic interpretability provided no advance warning.
Accurate100%Feb 22, 2026
The recent case of “extreme sycophancy” in GPT-4o is illustrative. It seems clear now that this was a genuine emergent problem. It was detected behaviorally, the root causes were found via free-form analysis, and the problem was fixed by improving post-training and system prompt design. As far as the public knows, no circuit was discovered, no particular weights or activations were held responsible, and no mechanistic analysis sounded a warning bell or informed the solutions.

Goodfire is launching a fall 2025 fellowship program bringing on Research Fellows and Research Engineering Fellows to work full-time in San Francisco on core interpretability projects. The 3-month intensive program targets early- to mid-career researchers and engineers, with research directions spanning representational structure, scientific discovery via interpretability, causal analysis, and representation dynamics. Exceptional fellows may convert to full-time positions.

★★★☆☆
Claims (1)
In October 2025, Goodfire announced a Fellowship Program for early- and mid-career researchers and engineers, matched with senior researchers to work on scientific discovery, interpretable models, and new interpretability methods.
In October 2025, Goodfire announced a Fellowship Program for early- and mid-career researchers and engineers, matched with senior researchers to work on scientific discovery, interpretable models, and new interpretability methods.

Goodfire, an AI interpretability startup founded by former OpenAI and DeepMind researchers, raised $50M in Series A funding led by Menlo Ventures to develop mechanistic interpretability tools. Their platform, Ember, provides programmable access to neural network internals—enabling users to examine neurons, uncover embedded knowledge, and steer model behavior. The funding reflects growing industry interest in solving the 'black box' problem that undermines AI safety and reliability.

Claims (1)
Ember is Goodfire's core product—a mechanistic interpretability API that provides direct, programmable access to AI model internals. Unlike traditional approaches that treat models as black boxes accessible only through prompts, Ember allows users to:
Accurate100%Feb 22, 2026
Goodfire’s flagship product, Ember, is designed to decode the internal workings of AI models. Unlike traditional black-box approaches, Ember offers programmable access to the individual neurons and internal logic of a neural network.

Goodfire is an AI research company focused on mechanistic interpretability as a path to safer, more controllable AI systems. The team, which includes founding members of interpretability efforts at Google DeepMind and OpenAI, aims to make AI systems understandable, debuggable, and steerable rather than treating them as black boxes. Their work spans sparse autoencoders, automated feature interpretation, and knowledge extraction from models.

★★★☆☆
Claims (1)
Goodfire is an AI interpretability research lab and public benefit corporation specializing in mechanistic interpretability—the science of reverse-engineering neural networks to understand and control their internal workings. Founded in June 2024 by Eric Ho (CEO), Dan Balsam (CTO), and Tom McGrath (Chief Scientist), the company aims to transform opaque AI systems into transparent, steerable, and safer technologies.

Goodfire researchers propose 'model diff amplification' (also called logit diff amplification), a method for surfacing rare, undesired post-training behaviors in LLMs by amplifying the logit differences between pre- and post-training models during sampling. They demonstrate the technique on a realistic emergent misalignment scenario where training data is only subtly contaminated, showing the method can detect harmful behaviors that would otherwise require prohibitively many rollouts to observe.

★★★☆☆
Claims (1)
Goodfire's model diffing and auditing tools aim to identify rare, undesired behaviors—such as a model encouraging self-harm—that might emerge during training or deployment. However, there is ongoing debate within the interpretability community about whether these techniques will scale to worst-case scenarios involving sophisticated deception.
Accurate100%Feb 22, 2026
One of the biggest issues with LLMs is that training can cause unexpected and undesired behaviors to show up in specific circumstances.

Zvi Mowshowitz explains why applying optimization pressure to interpretability techniques like Chain of Thought reasoning is deeply dangerous for AI safety. Drawing on an OpenAI paper, he argues that training on monitoring signals causes models to obfuscate their reasoning and evade oversight in exactly the ways most harmful for safety. The core principle: only train on final outputs, never on the interpretability methods used to detect misbehavior.

★★☆☆☆
Claims (1)
In February 2026, Goodfire Chief Scientist Thomas McGrath published "Intentionally Designing the Future of AI," proposing the use of interpretability tools to shape model training by decomposing gradients into semantic components and selectively applying them on a per-datapoint basis. This reignited a significant debate in the AI safety community about what Zvi Mowshowitz termed "The Most Forbidden Technique"—using interpretability techniques during training.
Inaccurate30%Feb 22, 2026
The Most Forbidden Technique is training an AI using interpretability techniques.

WRONG DATE: The article was published in March 2025, not February 2026. FABRICATED DETAIL: The source does not mention Thomas McGrath or his paper 'Intentionally Designing the Future of AI'. WRONG ATTRIBUTION: The source is written by Zvi Mowshowitz, not Thomas McGrath.

10LessWrong: The Most Forbidden TechniqueLessWrong·Zvi·2025·Blog post

Zvi analyzes an OpenAI paper demonstrating that applying optimization pressure to Chain-of-Thought monitoring causes models to hide reward-hacking behavior rather than eliminate it, destroying the very oversight capability. The core principle: interpretability techniques must never be used as training targets, only as monitoring tools, because doing so creates adversarial pressure that defeats detection of misalignment.

★★★☆☆
Claims (1)
Critics argue that optimizing against interpretability signals during training teaches models to obfuscate their internal representations, ultimately degrading the very tools needed to detect misalignment. As Mowshowitz summarized: if you train against technique [T], "you are training the AI to obfuscate its thinking, and defeat [T]." A LessWrong post specifically questioning Goodfire's approach noted that even structurally different methods like gradient decomposition may not escape this fundamental dynamic, since "selection pressure just goes into the parts that you don't know about or don't completely understand."
Accurate100%Feb 22, 2026
If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T].

Goodfire announces Ember, the first hosted mechanistic interpretability API providing access to sparse autoencoder (SAE) models for analyzing and steering large language models like Llama 3.3 70B. The platform exposes 'features' as interpretable patterns of neuron activity, enabling researchers and organizations to programmatically inspect and modify model internals for safety and alignment purposes.

★★★☆☆
Claims (1)
- Apollo Research: Using Goodfire tools for safety benchmarks and research.
Accurate100%Feb 22, 2026
Ember is already being used by leading organizations like Rakuten, Apollo Research, and Haize Labs, among others.

Goodfire outlines their organizational philosophy and technical strategy for AI safety, emphasizing mechanistic interpretability as a core tool for understanding and controlling AI systems. The post describes how their research into neural network internals aims to make AI systems more transparent, steerable, and aligned with human intentions. It positions interpretability work as both commercially viable and safety-critical.

★★★☆☆
Claims (3)
Goodfire emphasizes safety-first applications of interpretability:
Accurate100%Feb 22, 2026
At Goodfire, safety is fundamental to our mission.
Pre-release safety measures for Ember include feature moderation (removing harmful/explicit/malicious features), input/output filtering, and controlled access for researchers.
Accurate100%Feb 22, 2026
Before releasing Ember, we implemented several key safety measures: 1) Feature Moderation We've added robust moderation systems to filter out potentially harmful features. This includes removing features associated with: Harmful or dangerous content Explicit material Malicious behaviors 2) Input/Output Filtering We carefully monitor and filter both user inputs and model outputs to prevent misuse of our API. 3) Controlled Access Safety researchers interested in studying filtered features can request access through contact@goodfire.ai .
- Haize Labs: Joint work on feature steering for AI safety auditing, red-teaming, and identifying failure modes in generative models.
Accurate100%Feb 22, 2026
Our recent collaboration with Haize Labs demonstrates how feature steering can be used to probe model behaviors and identify potential vulnerabilities.

Goodfire announces its Series B funding round, outlining the company's mission to advance mechanistic interpretability research to understand, learn from, and design AI systems. The post highlights the company's vision for making AI internals legible and controllable, positioning interpretability as central to safe and beneficial AI development.

★★★☆☆
Claims (1)
The company raised \$50 million in Series A funding less than one year after founding, led by Menlo Ventures with participation from Anthropic—marking Anthropic's first direct investment in another company. Anthropic CEO Dario Amodei stated that "mechanistic interpretability is among the best bets to help us transform black-box neural networks into understandable, steerable systems." In February 2026, Goodfire raised a further \$150 million in Series B funding at a \$1.25 billion valuation, led by B Capital with participation from Salesforce Ventures, Eric Schmidt, and existing investors.
Minor issues80%Feb 22, 2026
Today, we're excited to announce a $150 million Series B funding round at a $1.25 billion valuation. The round was led by B Capital, with participation from Juniper Ventures, DFJ Growth, Salesforce Ventures, Menlo Ventures, Lightspeed Venture Partners, South Park Commons, Wing Venture Capital, Eric Schmidt, and others.

The source does not mention Anthropic's first direct investment in another company. The source does not include a quote from Dario Amodei. The source lists Juniper Ventures, DFJ Growth, Menlo Ventures, Lightspeed Venture Partners, South Park Commons, and Wing Venture Capital as participants in the Series B funding round, but the wiki claim only mentions Salesforce Ventures, Eric Schmidt, and existing investors.

Menlo Ventures announces leading a $50M Series A investment in Goodfire, an AI interpretability startup focused on understanding the internal reasoning and representations of AI models. The post explains the investment thesis, highlighting interpretability as a critical frontier for AI safety and reliability. It positions mechanistic interpretability research as increasingly important for commercial AI deployment.

Claims (1)
The team includes:
Unsupported0%Feb 22, 2026
Their team includes some of the best researchers in the field of interpretability, including co-founder Tom McGrath, Ph.D., who co-founded the interpretability team at DeepMind; Lee Sharkey, who pioneered the use of sparse autoencoders in language models and co-founded Apollo Research; and Nick Cammarata, who started the interpretability team at OpenAI alongside now Anthropic co-founder Chris Olah.

Goodfire AI announces a collaboration with Mayo Clinic to apply mechanistic interpretability techniques to medical AI systems, aiming to make clinical AI models more transparent and trustworthy. The partnership represents an early real-world application of interpretability research in high-stakes healthcare settings where understanding model behavior is critical for patient safety.

★★★☆☆
Claims (1)
- Mayo Clinic: Announced in September 2025, focusing on genomic medicine, reverse-engineering genomics models for insights into disease mechanisms while emphasizing data privacy and bias reduction.
Accurate100%Feb 22, 2026
Goodfire is excited to announce a collaboration with Mayo Clinic seeking to unlock new frontiers in genomic medicine through AI interpretability.
16Contrary Research: Goodfireresearch.contrary.com

Contrary Research provides a company profile and analysis of Goodfire, an AI interpretability startup focused on mechanistic interpretability tools for understanding and steering neural network behavior. The resource covers Goodfire's founding, product direction, and market positioning in the AI safety and interpretability space.

Claims (17)
Goodfire has raised approximately \$209 million across three rounds:
Inaccurate33%Feb 22, 2026
As of August 2025, Goodfire has raised a total of $57.2 million .

WRONG NUMBERS: The source states that Goodfire has raised a total of $57.2 million, not $209 million. FABRICATED DETAILS: The source does not mention the number of funding rounds.

The platform supports models like Llama 3.3 70B and processes tokens at a rate that has tripled monthly since launch in December 2024.
Minor issues90%Feb 22, 2026
In December 2024 , Goodfire shipped Ember , the first hosted mechanistic-interpretability API, with support for Llama-3.3 70B and 3.1 8B.

The claim states "Llama 3.3 70B", but the source states "Llama 3.3 70B and 3.1 8B". The claim states that the token processing rate has tripled monthly since launch in December 2024, but the source states that the number of tokens Ember is processing is nearly tripling monthly.

Eric Ho and Dan Balsam had previously co-founded RippleMatch in 2016, an AI-powered hiring platform that Ho scaled to over \$10 million in annual recurring revenue. Ho's work at RippleMatch earned him recognition on Forbes's 30 Under 30 list in 2022.
Minor issues90%Feb 22, 2026
Goodfire was founded in June 2024 by Eric Ho (CEO), Dan Balsam (CTO), and Tom McGrath (Chief Scientist). Ho and Balsam began working together as CEO and founding engineer, respectively, at RippleMatch , a venture they formed in 2016 to reimagine the future of work with AI. Ho, frustrated with the broken, unfair, and inefficient hiring system after his personal experience of job searching during his senior year at Yale, founded RippleMatch in 2016 and spent the next seven years increasing job accessibility with AI. Eventually, he was named to Forbes's 30 under 30 list for his work in 2022.

The claim that RippleMatch is an AI-powered hiring platform is not directly stated in the source, but it is implied. The source states that Ho and Balsam formed RippleMatch in 2016 to reimagine the future of work with AI, and that Balsam was the first to integrate AI into RippleMatch products. The claim that Ho scaled RippleMatch to over $10 million in annual recurring revenue is not directly stated in the source. The source states that Ho was named to Forbes's 30 under 30 list for his work in 2022, but it does not explicitly state that this was for his work at RippleMatch.

+14 more claims

Goodfire, an AI interpretability research lab, secured $125 million in funding to advance mechanistic interpretability research. The lab focuses on understanding the internal workings of neural networks, a key area of AI safety research aimed at making AI systems more transparent and understandable.

Claims (1)
A notable scientific achievement came from Goodfire's partnership with Arc Institute: by reverse-engineering a biological foundation model, the team identified a novel class of Alzheimer's biomarkers—described as "the first major finding in the natural sciences obtained from reverse-engineering a foundation model."
Minor issues90%Feb 22, 2026
Goodfire recently identified a novel class of Alzheimer’s biomarkers in this way, by applying interpretability techniques to an epigenetic model built by Prima Mente—the first major finding in the natural sciences obtained from reverse-engineering a foundation model.

The claim states that Goodfire partnered with Arc Institute, but the source says they partnered with Mayo Clinic, Arc Institute, and Prima Mente. The source states that Goodfire applied interpretability techniques to an epigenetic model built by Prima Mente, not a biological foundation model.

CB Insights profile page for Goodfire AI, a company focused on AI interpretability and mechanistic analysis of neural networks. Goodfire develops tools to understand and steer AI model behavior at the feature level, building on advances in sparse autoencoders and related interpretability research.

Claims (1)
Goodfire is an AI interpretability research lab and public benefit corporation specializing in mechanistic interpretability—the science of reverse-engineering neural networks to understand and control their internal workings. Founded in June 2024 by Eric Ho (CEO), Dan Balsam (CTO), and Tom McGrath (Chief Scientist), the company aims to transform opaque AI systems into transparent, steerable, and safer technologies.
Minor issues85%Feb 22, 2026
The AI interpretability market focuses on researching and understanding the inner workings of artificial intelligence models. Also known as mechanistic interpretability, these solutions aim to reverse-engineer neural networks, analyze model behavior, and identify the internal mechanisms behind AI outputs.

The source does not list the names of the CEO, CTO, and Chief Scientist. The source does not explicitly state that Goodfire aims to transform opaque AI systems into transparent, steerable, and safer technologies. This is an interpretation of the company's focus on mechanistic interpretability.

19Goodfire and Training on InterpretabilityLessWrong·Satya Benson·2026

This LessWrong post critically examines Goodfire's interpretability-in-the-loop training approach, raising concerns that optimization pressure on interpretability tools causes them to degrade—a known problem called 'The Most Forbidden Technique.' The author argues that decomposing gradients into semantic components may not prevent models from developing harmful behaviors through uninterpreted channels that selection pressure can exploit.

★★★☆☆
Claims (1)
Critics argue that optimizing against interpretability signals during training teaches models to obfuscate their internal representations, ultimately degrading the very tools needed to detect misalignment. As Mowshowitz summarized: if you train against technique [T], "you are training the AI to obfuscate its thinking, and defeat [T]." A LessWrong post specifically questioning Goodfire's approach noted that even structurally different methods like gradient decomposition may not escape this fundamental dynamic, since "selection pressure just goes into the parts that you don't know about or don't completely understand."
Accurate100%Feb 22, 2026
The selection pressure just goes into the parts that you don't know about or don't completely understand.

AE Studio's AI Alignment page describes their initiatives and commitments to ensuring AI systems are safe and aligned with human values. The page outlines their approach to contributing to the AI safety field through research, engineering, and collaboration with alignment-focused organizations.

Claims (1)
The EU AI Act mandates transparency for high-risk AI systems, with fines up to €20 million for violations. Goodfire's auditing and documentation capabilities could help organizations meet these requirements.

Goodfire is a San Francisco startup focused on mechanistic interpretability research, developing tools to make AI internal mechanisms transparent and controllable. Their Ember platform democratizes interpretability tools for researchers and developers, addressing core challenges like superposition in neural networks. The company frames interpretability as essential safety infrastructure as AI systems become more societally critical.

★★★☆☆
Claims (4)
Traditional alignment methods like reinforcement learning from human feedback (RLHF) can produce unintended side effects, such as excessive refusal of benign requests or sycophantic behavior. Goodfire's feature steering offers an alternative by enabling precise, quantitative alignment of specific behaviors without degrading overall model performance.
Unsupported30%Feb 22, 2026
Ember will find the relevant features and alter their strengths to fit the user’s request.

The source does not mention traditional alignment methods like reinforcement learning from human feedback (RLHF) or their side effects such as excessive refusal of benign requests or sycophantic behavior. The source mentions 'autosteering' as a feature of Ember, Goodfire's platform, but does not explicitly state that it offers an alternative to RLHF or that it enables precise, quantitative alignment of specific behaviors without degrading overall model performance.

- Steer behavior: Adjust feature activations to control model outputs without retraining or complex prompt engineering (e.g., making a model more "wise sage"-like by amplifying philosophical reasoning features)
Accurate100%Feb 22, 2026
With the platform, users are able to interpret their models to extract relevant features, giving them more control over AI systems.
- Model diffing: Track changes across training checkpoints to understand why problematic behaviors emerge
Accurate100%Feb 22, 2026
Model diffing would allow developers to look at a model, checkpoint to checkpoint, and see what changed, how it changed and why it changed. This would allow them to see why bad behavior developed in the first place allowing them to prevent these processes from occurring in future iterations of the model.
+1 more claims

Goodfire outlines their mission and philosophy around intentional AI design, emphasizing that the development of AI systems should be deliberate, interpretable, and safety-conscious rather than driven purely by capability metrics. The post articulates the company's commitment to mechanistic interpretability as a foundation for building AI that humans can understand and reliably control.

★★★☆☆
Claims (1)
In February 2026, Goodfire announced a broader vision called "intentional design"—using interpretability to guide model training rather than merely analyzing models post-hoc. The approach involves decomposing what a model learns from each datapoint into semantic components, then selectively applying or filtering these learning signals.
Accurate100%Feb 22, 2026
At Goodfire, we're developing the science and technology that lets us steer model training - a process we're calling intentional design.

Neel Nanda defends the legitimacy of research into using model internals (interpretability techniques) during training as a valuable AI safety direction, pushing back against community sentiment that treats it as a 'forbidden technique.' He argues the research is necessary to evaluate feasibility and safety implications, while noting it remains practically distant from frontier model deployment.

★★★☆☆
Claims (1)
Nanda noted that multiple researchers including Anthropic Fellows have worked on interpretability-in-training, and that understanding its risks and benefits requires empirical research rather than blanket prohibition. He acknowledged key uncertainties: "I don't know how well it will work, how much it will break interpretability tools, or which things are more or less dangerous."

A customer story describing how Rakuten partnered with Goodfire to apply mechanistic interpretability tools to their AI systems in a commercial setting. It showcases a real-world use case of interpretability technology helping a major e-commerce and technology company understand and improve AI model behavior.

★★★☆☆
Claims (2)
- PII detection: Partnering with Rakuten to use sparse autoencoder probes to prevent personally identifiable information leakage
Accurate100%Feb 22, 2026
In 2024, Rakuten and Goodfire partnered to explore ways to make Rakuten AI even more reliable and trustworthy, leading the industry to use frontier interpretability to improve security and prevent customers’ PII (personally identifiable information) from being sent downstream to model providers.
- Rakuten: Enhancing reliability for Rakuten AI, which serves over 44 million monthly users in Japan and 2 billion customers worldwide, focusing on preventing PII leakage using frontier interpretability techniques.
Accurate100%Feb 22, 2026
Founded in Tokyo in 1997, Rakuten is a global technology leader in services that empower individuals, communities, businesses and society, serving over 44 million monthly active users in Japan and a total of two billion customers worldwide.
Citation verification: 27 verified, 2 flagged, 9 unchecked of 50 total

Structured Data

6 recordsView in FactBase →

Key People

4
EH
Eric HoFounder
Co-Founder & CEO · 2023–present
DB
Daniel BalsamFounder
Co-Founder & CTO · 2023–present
JL
Josh Lee
COO · 2023–present
TM
Thomas McGrathFounder
Co-Founder & Chief Scientist · 2023–present

Funding History

2
RoundDateRaisedValuationLead InvestorSource
Seed
Jun 2024$7M
Series A
Apr 2025$50M$200MMenlo Ventures

Related Wiki Pages

Top Related Pages

Safety Research

Anthropic Core Views

Approaches

AI AlignmentAI Output Filtering

Policy

EU AI Act

Organizations

Google DeepMindApollo ResearchSeldon LabElicit (AI Research Tool)Apart Research

Other

InterpretabilityMechanistic InterpretabilityRLHFNeel NandaZvi MowshowitzMax Tegmark

Risks

SchemingSycophancy

Concepts

Safety Orgs Overview

Key Debates

AI Alignment Research Agendas