Skip to content
Longterm Wiki

Interpretability

Interpretabilityactive

Research field focused on understanding how AI models work internally, encompassing all approaches to model transparency.

Organizations
3
Grants
115
Total Funding
$111M
Risks Addressed
3
Cluster: Interpretability

Tags

interpretabilitysafety-research

Organizations3

OrganizationRole
Anthropicpioneer
Google DeepMindactive
Redwood Researchactive

GrantsTop 50 of 115

NameRecipientAmountFunderDate
MATS Research — General SupportMATS ML Alignment Theory Scholars program$24MCoefficient Giving2025-05
Against Malaria Foundation — General Support (January 2017)Against Malaria Foundation$15MCoefficient Giving2017-01
Schistosomiasis Control Initiative — General Support (2017)Schistosomiasis Control Initiative$14MCoefficient Giving2017-01
Malaria Consortium — Seasonal Malaria Chemoprevention (February 2017)Malaria Consortium$5MCoefficient Giving2017-02
The END Fund — Deworming (December 2016)The END Fund$5MCoefficient Giving2016-12
Deworm the World Initiative — General Support (January 2017)Evidence Action$4.5MCoefficient Giving2017-01
MATS Research — MATS Program 6.0 + 7.0MATS ML Alignment Theory Scholars program$3.4MCoefficient Giving2024-06
Sightsavers — Deworming (January 2017)Sightsavers$3MCoefficient Giving2017-01
Eleuther AI — Interpretability ResearchEleuther AI$2.6MCoefficient Giving2023-11
Berkeley Existential Risk Initiative — MATS (November 2023)Berkeley Existential Risk Initiative$2.6MCoefficient Giving2023-11
GiveDirectly — General Support (January 2017)GiveDirectly$2.5MCoefficient Giving2017-01
AI Safety Support —  MATS Program 6.0 + 7.0AI Safety Support$2.4MCoefficient Giving2024-05
Apollo Research — General SupportApollo Research$2.2MCoefficient Giving2024-05
Apollo Research — Startup FundingApollo Research$1.5MCoefficient Giving2023-06
Drug Policy Alliance — Drug DecriminalizationDrug Policy Alliance$1.3MCoefficient Giving2017-03
University of California, Berkeley — PAVITRA Pollution Modeling ToolUniversity of California, Berkeley$1.3MCoefficient Giving2022-09
Center for Popular Democracy — Fed Up Campaign (2018)Center for Popular Democracy$1.2MCoefficient Giving2018-02
UC Berkeley — AI Safety Research (2018)University of California, Berkeley$1.1MCoefficient Giving2018-11
Center for Popular Democracy — Fed Up Campaign (2017)Center for Popular Democracy$1.1MCoefficient Giving2017-02
Northeastern University — Large Language Model Interpretability Research (2024)Northeastern University$1.1MCoefficient Giving2024-05
Harvard University — AI Interpretability, Controllability, and Safety ResearchHarvard University$1MCoefficient Giving2024-01
theguardian.org — Journalism on Factory Farming and Animal Cruelty (2020)theguardian.org$900KCoefficient Giving2020-01
theguardian.org — Journalism on Factory Farming and Farm Animal Crueltytheguardian.org$887KCoefficient Giving2017-12
Guide Labs — Open Access Interpretability ProjectGuide Labs$750KCoefficient Giving2023-08
Stanford University — AI Interpretability ResearchStanford University$744KCoefficient Giving2025-04
Practical AI Alignment and Interpretability Research Group — Interpretability WorkPractical AI Alignment and Interpretability Research Group$737KCoefficient Giving2024-09
AI Safety Support — MATS Program (November 2023)AI Safety Support$733KCoefficient Giving2023-11
AI Safety Support — MATS Program ExtensionAI Safety Support$710KCoefficient Giving2024-10
MATS Research — AI Safety Research ExpensesMATS ML Alignment Theory Scholars program$660KCoefficient Giving2024-12
Indian Institute of Technology Bombay — PAVITRA Pollution Modeling ToolIndian Institute of Technology Bombay$648KCoefficient Giving2022-11
Center for Popular Democracy — Fed Up Campaign (2019)Center for Popular Democracy$600KCoefficient Giving2019-11
Center for Study of Science, Technology and Policy — PAVITRA Pollution Modeling ToolCSTEP$565KCoefficient Giving2022-09
Northeastern University — Large Language Model Interpretability ResearchNortheastern University$562KCoefficient Giving2022-11
Alliance for Open Society International: Albanian RenaissanceAlliance for Open Society International$500KCoefficient Giving2014-07
Center for Popular Democracy — Fed Up Campaign (2020)Center for Popular Democracy$465KCoefficient Giving2020-06
theguardian.org – Journalism on Factory Farming and Animal Cruelty (2021)theguardian.org$450KCoefficient Giving2021-12
University of Washington — PAVITRA Pollution Modeling ToolUniversity of Washington$443KCoefficient Giving2022-09
The Kids Research Institute Australia — Malaria Vaccine ModellingThe Kids Research Institute Australia$410KCoefficient Giving2024-10
Apollo Research: Scale up interpretability & behavioral model evals researchApollo Research$339KManifund2023-07-20
Center for Popular Democracy Action Fund — Fed Up Campaign (2017)Center for Popular Democracy Action Fund$305KCoefficient Giving2017-02
Cooperation Canada — International Assistance Advocacy in CanadaCooperation Canada$300KCoefficient Giving2023-01
American Civil Liberties Union of Northern California — Prosecutorial AccountabilityAmerican Civil Liberties Union Foundation of Northern California$250KCoefficient Giving2017-02
Leap Labs — Interpretability ResearchLeap Labs$230KCoefficient Giving2023-04
University of Maryland — Study on Encoded Reasoning in LLMsUniversity of Maryland$218KCoefficient Giving2025-06
Charity Science: Health — General SupportCharity Science Health$200KCoefficient Giving2016-11
One year of seed funding for a new AI interpretability research organisationJessica Rumbelow$195KLong-Term Future Fund (LTFF)2023-01
Livestock Welfare Strategies — Improved Chicken Depopulation WelfareLivestock Welfare Strategies$165KCoefficient Giving2025-05
Scoping Developmental InterpretabilityJesse Hoogland$145KManifund2023-07-18
The Centre for Economic Policy Research — Macroeconomic Impacts of Cash TransfersThe Centre for Economic Policy Research$125KCoefficient Giving2024-12
FAR.AI — AI Interpretability ResearchFAR AI$100KCoefficient Giving2023-03

Funding by Funder

FunderGrantsTotal Amount
Coefficient Giving58$108M
Long-Term Future Fund (LTFF)41$1.7M
Manifund15$668K
FTX Future Fund1$30K

Sub-Areas7

NameStatusOrgsPapers
Activation MonitoringUsing probes and monitors on model internals to detect deception, harmful intent, or anomalous reasoning in real time.emerging30
Externalizing ReasoningEnsuring AI systems' chain-of-thought reasoning faithfully reflects their actual internal computations.emerging30
Interpretability BenchmarksStandardized tasks and metrics for comparing interpretability methods.emerging30
Linear ProbingLightweight interpretability using linear classifiers on model activations to detect features.active30
Mechanistic InterpretabilityReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior at a mechanistic level.active52
Representation EngineeringControlling AI behavior by directly manipulating internal representations, including activation addition and steering vectors.emerging31
Transparent ArchitecturesDesigning neural network architectures that are inherently more interpretable, as opposed to post-hoc interpretation of opaque models.emerging30