Interpretability
InterpretabilityactiveResearch field focused on understanding how AI models work internally, encompassing all approaches to model transparency.
Organizations
3
Grants
115
Total Funding
$111M
Risks Addressed
3
Cluster: Interpretability
Tags
interpretabilitysafety-research
Organizations3
| Organization | Role |
|---|---|
| Anthropic | pioneer |
| Google DeepMind | active |
| Redwood Research | active |
GrantsTop 50 of 115
| Name | Recipient | Amount | Funder | Date |
|---|---|---|---|---|
| MATS Research — General Support | MATS ML Alignment Theory Scholars program | $24M | Coefficient Giving | 2025-05 |
| Against Malaria Foundation — General Support (January 2017) | Against Malaria Foundation | $15M | Coefficient Giving | 2017-01 |
| Schistosomiasis Control Initiative — General Support (2017) | Schistosomiasis Control Initiative | $14M | Coefficient Giving | 2017-01 |
| Malaria Consortium — Seasonal Malaria Chemoprevention (February 2017) | Malaria Consortium | $5M | Coefficient Giving | 2017-02 |
| The END Fund — Deworming (December 2016) | The END Fund | $5M | Coefficient Giving | 2016-12 |
| Deworm the World Initiative — General Support (January 2017) | Evidence Action | $4.5M | Coefficient Giving | 2017-01 |
| MATS Research — MATS Program 6.0 + 7.0 | MATS ML Alignment Theory Scholars program | $3.4M | Coefficient Giving | 2024-06 |
| Sightsavers — Deworming (January 2017) | Sightsavers | $3M | Coefficient Giving | 2017-01 |
| Eleuther AI — Interpretability Research | Eleuther AI | $2.6M | Coefficient Giving | 2023-11 |
| Berkeley Existential Risk Initiative — MATS (November 2023) | Berkeley Existential Risk Initiative | $2.6M | Coefficient Giving | 2023-11 |
| GiveDirectly — General Support (January 2017) | GiveDirectly | $2.5M | Coefficient Giving | 2017-01 |
| AI Safety Support — MATS Program 6.0 + 7.0 | AI Safety Support | $2.4M | Coefficient Giving | 2024-05 |
| Apollo Research — General Support | Apollo Research | $2.2M | Coefficient Giving | 2024-05 |
| Apollo Research — Startup Funding | Apollo Research | $1.5M | Coefficient Giving | 2023-06 |
| Drug Policy Alliance — Drug Decriminalization | Drug Policy Alliance | $1.3M | Coefficient Giving | 2017-03 |
| University of California, Berkeley — PAVITRA Pollution Modeling Tool | University of California, Berkeley | $1.3M | Coefficient Giving | 2022-09 |
| Center for Popular Democracy — Fed Up Campaign (2018) | Center for Popular Democracy | $1.2M | Coefficient Giving | 2018-02 |
| UC Berkeley — AI Safety Research (2018) | University of California, Berkeley | $1.1M | Coefficient Giving | 2018-11 |
| Center for Popular Democracy — Fed Up Campaign (2017) | Center for Popular Democracy | $1.1M | Coefficient Giving | 2017-02 |
| Northeastern University — Large Language Model Interpretability Research (2024) | Northeastern University | $1.1M | Coefficient Giving | 2024-05 |
| Harvard University — AI Interpretability, Controllability, and Safety Research | Harvard University | $1M | Coefficient Giving | 2024-01 |
| theguardian.org — Journalism on Factory Farming and Animal Cruelty (2020) | theguardian.org | $900K | Coefficient Giving | 2020-01 |
| theguardian.org — Journalism on Factory Farming and Farm Animal Cruelty | theguardian.org | $887K | Coefficient Giving | 2017-12 |
| Guide Labs — Open Access Interpretability Project | Guide Labs | $750K | Coefficient Giving | 2023-08 |
| Stanford University — AI Interpretability Research | Stanford University | $744K | Coefficient Giving | 2025-04 |
| Practical AI Alignment and Interpretability Research Group — Interpretability Work | Practical AI Alignment and Interpretability Research Group | $737K | Coefficient Giving | 2024-09 |
| AI Safety Support — MATS Program (November 2023) | AI Safety Support | $733K | Coefficient Giving | 2023-11 |
| AI Safety Support — MATS Program Extension | AI Safety Support | $710K | Coefficient Giving | 2024-10 |
| MATS Research — AI Safety Research Expenses | MATS ML Alignment Theory Scholars program | $660K | Coefficient Giving | 2024-12 |
| Indian Institute of Technology Bombay — PAVITRA Pollution Modeling Tool | Indian Institute of Technology Bombay | $648K | Coefficient Giving | 2022-11 |
| Center for Popular Democracy — Fed Up Campaign (2019) | Center for Popular Democracy | $600K | Coefficient Giving | 2019-11 |
| Center for Study of Science, Technology and Policy — PAVITRA Pollution Modeling Tool | CSTEP | $565K | Coefficient Giving | 2022-09 |
| Northeastern University — Large Language Model Interpretability Research | Northeastern University | $562K | Coefficient Giving | 2022-11 |
| Alliance for Open Society International: Albanian Renaissance | Alliance for Open Society International | $500K | Coefficient Giving | 2014-07 |
| Center for Popular Democracy — Fed Up Campaign (2020) | Center for Popular Democracy | $465K | Coefficient Giving | 2020-06 |
| theguardian.org – Journalism on Factory Farming and Animal Cruelty (2021) | theguardian.org | $450K | Coefficient Giving | 2021-12 |
| University of Washington — PAVITRA Pollution Modeling Tool | University of Washington | $443K | Coefficient Giving | 2022-09 |
| The Kids Research Institute Australia — Malaria Vaccine Modelling | The Kids Research Institute Australia | $410K | Coefficient Giving | 2024-10 |
| Apollo Research: Scale up interpretability & behavioral model evals research | Apollo Research | $339K | Manifund | 2023-07-20 |
| Center for Popular Democracy Action Fund — Fed Up Campaign (2017) | Center for Popular Democracy Action Fund | $305K | Coefficient Giving | 2017-02 |
| Cooperation Canada — International Assistance Advocacy in Canada | Cooperation Canada | $300K | Coefficient Giving | 2023-01 |
| American Civil Liberties Union of Northern California — Prosecutorial Accountability | American Civil Liberties Union Foundation of Northern California | $250K | Coefficient Giving | 2017-02 |
| Leap Labs — Interpretability Research | Leap Labs | $230K | Coefficient Giving | 2023-04 |
| University of Maryland — Study on Encoded Reasoning in LLMs | University of Maryland | $218K | Coefficient Giving | 2025-06 |
| Charity Science: Health — General Support | Charity Science Health | $200K | Coefficient Giving | 2016-11 |
| One year of seed funding for a new AI interpretability research organisation | Jessica Rumbelow | $195K | Long-Term Future Fund (LTFF) | 2023-01 |
| Livestock Welfare Strategies — Improved Chicken Depopulation Welfare | Livestock Welfare Strategies | $165K | Coefficient Giving | 2025-05 |
| Scoping Developmental Interpretability | Jesse Hoogland | $145K | Manifund | 2023-07-18 |
| The Centre for Economic Policy Research — Macroeconomic Impacts of Cash Transfers | The Centre for Economic Policy Research | $125K | Coefficient Giving | 2024-12 |
| FAR.AI — AI Interpretability Research | FAR AI | $100K | Coefficient Giving | 2023-03 |
Funding by Funder
| Funder | Grants | Total Amount |
|---|---|---|
| Coefficient Giving | 58 | $108M |
| Long-Term Future Fund (LTFF) | 41 | $1.7M |
| Manifund | 15 | $668K |
| FTX Future Fund | 1 | $30K |
Sub-Areas7
| Name | Status | Orgs | Papers |
|---|---|---|---|
| Activation MonitoringUsing probes and monitors on model internals to detect deception, harmful intent, or anomalous reasoning in real time. | emerging | 3 | 0 |
| Externalizing ReasoningEnsuring AI systems' chain-of-thought reasoning faithfully reflects their actual internal computations. | emerging | 3 | 0 |
| Interpretability BenchmarksStandardized tasks and metrics for comparing interpretability methods. | emerging | 3 | 0 |
| Linear ProbingLightweight interpretability using linear classifiers on model activations to detect features. | active | 3 | 0 |
| Mechanistic InterpretabilityReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior at a mechanistic level. | active | 5 | 2 |
| Representation EngineeringControlling AI behavior by directly manipulating internal representations, including activation addition and steering vectors. | emerging | 3 | 1 |
| Transparent ArchitecturesDesigning neural network architectures that are inherently more interpretable, as opposed to post-hoc interpretation of opaque models. | emerging | 3 | 0 |