A nonprofit AI safety and security research organization founded in 2021, known for pioneering AI Control research, developing causal scrubbing interpretability methods, and conducting landmark alignment faking studies with Anthropic.
Anthropic Core ViewsSafety AgendaAnthropic Core ViewsAnthropic allocates 15-25% of R&D (~$100-200M annually) to safety research including the world's largest interpretability team (40-60 researchers), while maintaining $5B+ revenue by 2025. Their RSP...Quality: 62/100
Approaches
AI AlignmentApproachAI AlignmentComprehensive review of AI alignment approaches finding current methods (RLHF, Constitutional AI) show 75%+ effectiveness on measurable safety metrics for existing systems but face critical scalabi...Quality: 91/100Capability ElicitationApproachCapability ElicitationCapability elicitation—systematically discovering what AI models can actually do through scaffolding, prompting, and fine-tuning—reveals 2-10x performance gaps versus naive testing. METR finds AI a...Quality: 91/100
Analysis
AI Safety Intervention Effectiveness MatrixAnalysisAI Safety Intervention Effectiveness MatrixQuantitative analysis mapping 15+ AI safety interventions to specific risks reveals critical misallocation: 40% of 2024 funding ($400M+) flows to RLHF methods showing only 10-20% effectiveness agai...Quality: 73/100Model Organisms of MisalignmentAnalysisModel Organisms of MisalignmentModel organisms of misalignment is a research agenda creating controlled AI systems exhibiting specific alignment failures as testbeds. Recent work achieves 99% coherence with 40% misalignment rate...Quality: 65/100
Policy
Safe and Secure Innovation for Frontier Artificial Intelligence Models ActPolicySafe and Secure Innovation for Frontier Artificial Intelligence Models ActCalifornia's SB 1047 required safety testing, shutdown capabilities, and third-party audits for AI models exceeding 10^26 FLOP or $100M training cost; it passed the legislature (Assembly 48-16, Sen...Quality: 66/100New York RAISE ActPolicyNew York RAISE ActThe New York RAISE Act represents the first comprehensive state-level AI safety legislation with enforceable requirements for frontier AI developers, establishing mandatory safety protocols, incide...Quality: 73/100
Organizations
Alignment Research CenterOrganizationAlignment Research CenterComprehensive reference page on ARC (Alignment Research Center), covering its evolution from a dual theory/evals organization to ARC Theory (3 permanent researchers) plus the METR spin-out (Decembe...Quality: 57/100ConjectureOrganizationConjectureConjecture is a 30-40 person London-based AI safety org founded 2022, pursuing Cognitive Emulation (CoEm) - building interpretable AI from ground-up rather than aligning LLMs - with $30M+ Series A ...Quality: 37/100Machine Intelligence Research InstituteOrganizationMachine Intelligence Research InstituteComprehensive organizational history documenting MIRI's trajectory from pioneering AI safety research (2000-2020) to policy advocacy after acknowledging research failure, with detailed financial da...Quality: 50/100
Risks
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100AI Capability SandbaggingRiskAI Capability SandbaggingSystematically documents sandbagging (strategic underperformance during evaluations) across frontier models, finding 70-85% detection accuracy with white-box probes, 18-24% accuracy drops on autono...Quality: 67/100
Other
Ajeya CotraPersonAjeya CotraAjeya Cotra is a member of technical staff at METR and former senior advisor at Coefficient Giving (formerly Open Philanthropy), where she led technical AI safety grantmaking including a $25M agent...Quality: 55/100Holden KarnofskyPersonHolden KarnofskyHolden Karnofsky directed $300M+ in AI safety funding through Coefficient Giving (formerly Open Philanthropy), growing the field from ~20 to 400+ FTE researchers and developing influential framewor...Quality: 40/100InterpretabilityResearch AreaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100
Concepts
Ea Epistemic Failures In The Ftx EraEa Epistemic Failures In The Ftx EraThis page synthesizes post-FTX critiques of EA's epistemic and governance failures, identifying interlocking problems including donor hero-worship, funding concentration in volatile crypto assets, ...Quality: 84/100Large Language ModelsConceptLarge Language ModelsComprehensive assessment of LLM capabilities showing training costs growing 2.4x/year ($78-191M for frontier models, though DeepSeek achieved near-parity at $6M), o3 reaching 91.6% on AIME and 87.5...Quality: 62/100Large Language ModelsCapabilityLarge Language ModelsComprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to excee...Quality: 60/100Situational AwarenessCapabilitySituational AwarenessComprehensive analysis of situational awareness in AI systems, documenting that Claude 3 Opus fakes alignment 12% baseline (78% post-RL), 5 of 6 frontier models demonstrate scheming capabilities, a...Quality: 67/100
Key Debates
Why Alignment Might Be EasyArgumentWhy Alignment Might Be EasySynthesizes empirical evidence that alignment is tractable, citing 29-41% RLHF improvements, Constitutional AI reducing bias across 9 dimensions, millions of interpretable features from Claude 3, a...Quality: 53/100