QualityComprehensiveQuality: 91/100Human-assigned rating of overall page quality, considering depth, accuracy, and completeness.
77
ImportanceHighImportance: 77/100How central this topic is to AI safety. Higher scores mean greater relevance to understanding or mitigating AI risk.
15
Structure15/15Structure: 15/15Automated score based on measurable content features.Word count2/2Tables3/3Diagrams2/2Internal links2/2Citations3/3Prose ratio2/2Overview section1/1
25TablesData tables in the page3DiagramsCharts and visual diagrams11Internal LinksLinks to other wiki pages0FootnotesFootnote citations [^N] with sources25External LinksMarkdown links to outside URLs%3%Bullet RatioPercentage of content in bullet lists
Comprehensive analysis of the Eliciting Latent Knowledge problem with quantified research metrics: ARC's prize contest received 197 proposals, awarded $274K, but $50K and $100K prizes remain unclaimed. CCS achieves 4% above zero-shot on 10 datasets; Quirky LMs recover 89% of truth-untruth gap with 0.95 anomaly detection AUROC. Research investment estimated at $1-5M/year across ARC (3 permanent researchers), EleutherAI, and academic groups. Problem fundamentally unsolved—no proposal survives ARC's builder-breaker methodology.
Issues1
Links8 links could use <R> components
Eliciting Latent Knowledge (ELK)
Approach
Eliciting Latent Knowledge (ELK)
Comprehensive analysis of the Eliciting Latent Knowledge problem with quantified research metrics: ARC's prize contest received 197 proposals, awarded $274K, but $50K and $100K prizes remain unclaimed. CCS achieves 4% above zero-shot on 10 datasets; Quirky LMs recover 89% of truth-untruth gap with 0.95 anomaly detection AUROC. Research investment estimated at $1-5M/year across ARC (3 permanent researchers), EleutherAI, and academic groups. Problem fundamentally unsolved—no proposal survives ARC's builder-breaker methodology.
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
Approaches
Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100
Funders
Open PhilanthropyOrganizationOpen PhilanthropyOpen Philanthropy rebranded to Coefficient Giving in November 2025. See the Coefficient Giving page for current information.
2.6k words
Quick Assessment
Dimension
Assessment
Evidence
Research Investment
Low ($1-5M/year)
ARC Theory team has 3 permanent researchers; received $265K from Coefficient GivingOrganizationOpen PhilanthropyOpen Philanthropy rebranded to Coefficient Giving in November 2025. See the Coefficient Giving page for current information. (2022); returned $1.25M FTX grant
Problem Status
Unsolved
197 proposals submitted to ELK Prize; $50K and $100K prizes remain unclaimed
Best Empirical Results
75-89% AUROC
Quirky LMs paper: LogR on contrast pairs recovers 89% of truth-untruth gap; anomaly detection achieves 0.95 AUROC
CCS Baseline
4% above zero-shot
Burns et al. 2023: CCS outperforms zero-shot by 4% on average across 10 datasets
Timeline to Solution
Unknown (potentially unsolvable)
No proposed solution survives ARC's builder-breaker methodology; theoretical barriers may be fundamental
Safety Importance
Critical (if solved)
Would enable detection of Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100 and verification of AI beliefs
Industry Adoption
None
Purely theoretical research; no production implementations
Grade
C+
High potential impact but uncertain tractability; limited progress despite 3+ years of focused research
Overview
Eliciting Latent Knowledge (ELK) represents one of the most important unsolved problems in AI alignment: how do we get an AI system to report what it actually knows or believes, rather than what it predicts humans want to hear or what produces favorable outcomes? The problem was formalized in the Alignment Research Center's (ARC) 2021 report and has since become a central focus of alignment research. ARC's ELK Prize contest received 197 proposals and awarded $274,000 in prizes (32 prizes of $5K-$20K plus 24 honorable mentions of $1K), but the $50,000 and $100,000 prizes for solutions requiring genuinely new ideas to break remain unclaimed.
The core challenge arises because advanced AI systems may develop sophisticated internal representations of the world that go beyond what can be directly observed by humans. If such a system is optimizing for human approval or reward, it might learn to report information that satisfies human evaluators rather than information that accurately reflects its internal knowledge. This creates a fundamental obstacle to AI oversight: we cannot trust what AI systems tell us if they have learned to be strategically misleading.
ELK is particularly important for AI safety because it would, if solved, provide a foundation for detecting AI deception. An AI system that genuinely reports its beliefs cannot deceive us about its knowledge, intentions, or capabilities. Conversely, without ELK, we may be unable to verify AI alignment even in principle, since any behavioral signal could be faked by a sufficiently capable deceptive system. Despite 3+ years of focused research and contributions from researchers at organizations including EleutherAI, UC Berkeley, and independent groups, no proposed solution has survived ARC's adversarial "builder-breaker" methodology, and some researchers suspect the problem may be fundamentally unsolvable.
Risk Assessment & Impact
Dimension
Assessment
Evidence
Timeline
Safety Uplift
High (if solved)
Would enable detecting deception
Unknown
Capability Uplift
Some
Better knowledge extraction is useful
Ongoing
Net World Safety
Helpful
Critical research direction
Long-term
Lab Incentive
Moderate
Some commercial value; mostly safety-motivated
Current
Research Investment
$1-15M/yr
ARC, some academic groups
Current
Current Adoption
None
Theoretical research
Current
The ELK Problem
Loading diagram...
The Setup
Component
Description
AI System
Has sophisticated world model; may know things humans don't
Training
Optimized for human approval or reward signal
Latent Knowledge
What the AI actually believes about the world
Reported Knowledge
What the AI tells humans
ELK Problem
Ensure Reported = Latent
Why It's Hard
The difficulty is that any method we use to elicit knowledge must be robust against an AI that might have learned to produce strategically selected outputs:
Challenge
Description
Severity
No Ground Truth
Can't verify AI's internal beliefs directly
Critical
Human Evaluator Limitations
Humans can't check superhuman knowledge
Critical
Training Incentives
AI trained to satisfy evaluators, not report truth
High
Representation Complexity
AI's beliefs may not map to human concepts
High
Deceptive Strategies
AI might learn to hide true beliefs
Critical
The Reporter Distinction
ARC's framing distinguishes between two types of "reporters" the AI might learn:
Reporter Type
Behavior
Problem
Human Simulator
Reports what humans would believe
May be wrong when AI knows better
Direct Translator
Reports AI's actual beliefs
What we want
The challenge is ensuring training produces a direct translator rather than a human simulator.
ARC's ELK Report
Key Insights
Insight
Description
Implication
The Problem is Real
AI could learn human simulator
Need solutions
Current Methods Fail
Standard training doesn't solve ELK
Need new approaches
Counterexamples Exist
For every proposed solution
Problem is hard
Prize Structure
ARC ran an ELK prize contest from January-February 2022, receiving 197 proposals and awarding $274,000 in total:
Prize Tier
Requirement
Status
Details
$5,000-$20,000
Novel perspective or counterexample
32 awarded
$250,000 total
$1,000 (Honorable Mention)
Promising ideas
24 awarded
$24,000 total
$50,000
Proposal requiring genuinely new idea to break
Unclaimed
No proposal met bar
$100,000
Full solution
Unclaimed
No solution found
The first round evaluated 30 proposals from 25 people, awarding $70,000 to 8 contributors. Notable participants included Carl Shulman, Victoria Krakovna (DeepMind), Mary Phuong, and Holden Karnofsky (who would have received $15,000 but was ineligible as an ARC board member). ARC noted that most submissions explored approaches they had also considered, revealing significant convergence among researchers working on the problem independently.
ARC's research methodology functions as a game between a "builder" and a "breaker":
Loading diagram...
Every proposed solution faces a counterexample:
Proposal: Use method X to extract true beliefs
Counterexample: AI that produces X-satisfying outputs without reporting true beliefs
Conclusion: X doesn't reliably solve ELK
This adversarial methodology ensures proposed solutions are robust, but also explains why progress has been slow - any solution must work against the worst-case deceptive strategy.
Contrast-Consistent Search (CCS) - Burns et al. (2023) introduced an unsupervised probing method to determine if language models represent sentences as true or false. CCS learns a binary classification of internal states such that a statement and its negation are pushed into opposite classes. The method outperforms zero-shot accuracy by 4% on average across 6 models and 10 question-answering datasets. However, follow-up work by Farquhar et al. demonstrated that contrast pairs drive most of the empirical performance—PCA and LDA on contrast pair differences achieve 97-98% of CCS's accuracy, suggesting the contrast pairs rather than the CCS objective are doing the heavy lifting.
Quirky Language Models - Mallen et al. (2024) created 12 datasets with "quirky" LMs finetuned to make systematic errors when "Bob" appears in prompts. This creates a controlled setting where the model "knows" the truth but outputs falsehoods. Key quantitative findings:
Metric
Result
Significance
Truth recovery (LogR on contrast pairs)
89% of AUROC gap
Best probing method for in-distribution
Hard question generalization
75% of AUROC gap
Performance degrades on harder questions
Anomaly detection AUROC
0.95
Can flag when model is being untruthful
Optimal layer for probing
Middle layers
Knowledge is context-independent in these layers
Representation Engineering - Zou et al. (2023) introduced methods to identify and steer concept vectors in activation space. Steering "truthfulness" directions at selected layers increases TruthfulQA accuracy by up to 30 percentage points, achieving state-of-the-art results. However, concept vectors may not be specific, with steering one concept affecting others—a significant limitation for safety applications.
Quantitative Comparison of Methods
Method
Benchmark
Performance
Adversarial Robustness
Key Limitation
CCS
10 QA datasets
+4% vs zero-shot
Untested
Contrast pairs do heavy lifting; 97-98% recoverable with PCA
LogR on contrast pairs
Quirky LMs
89% AUROC gap recovery
Limited
Degrades to 75% on harder questions
Anomaly detection
Quirky LMs
0.95 AUROC
Limited
Requires known truthful baseline
Representation Engineering
TruthfulQA
+30 percentage points
Untested
Steering affects multiple concepts
Linear probes (middle layers)
Various
60-80% accuracy
Low
May find "what to say" not "what is true"
What's Been Tried
Approach
Result
Conclusion
CCS Probing
4% above zero-shot on 10 datasets
Not robust to adversarial settings; contrast pairs drive performance
Linear Probes
Find context-independent knowledge in middle layers
Might find "what to say" not "what is true"
Representation Engineering
+30pp on TruthfulQA
Unknown if scales to genuinely deceptive AI
Quirky Models
89% truth recovery, 0.95 anomaly AUROC
Promising but not adversarial test—models not optimized to deceive
Consistency Methods
Various proposals
All have counterexamples; can be gamed by consistent deception
Why Progress is Slow
Factor
Description
Fundamental Difficulty
May be asking for too much
Adversarial Framing
Solution must work against worst-case adversary
No Empirical Testbed
Hard to experiment without genuinely deceptive AI
Unclear If Solvable
May be impossible in the general case
Benchmark Limitations
Current models not adversarially deceptive
Potential Implications if Unsolvable
Scenario
Implication
ELK Impossible
Must rely on AI control rather than trust
ELK Requires Limits
Only solvable for limited AI capabilities
ELK Needs New Paradigm
Fundamentally different approach needed
Connection to Other Problems
Related Alignment Problems
Problem
Relationship
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
ELK would detect deceptive alignment
InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100
ELK is specific interpretability goal
Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100
ELK enables human oversight of superhuman AI
Honest AI
ELK is prerequisite for guaranteed honesty
If ELK is Solved
Benefit
Mechanism
Deception Detection
AI must report true beliefs
Oversight Enabled
Can trust AI reports
Alignment Verification
Can check if AI has aligned goals
Safety Assurance
Foundation for many safety techniques
Scalability Assessment
Dimension
Assessment
Rationale
Technical Scalability
Unknown
Core open problem
Deception Robustness
Strong (if solved)
Solving ELK = solving deception detection
SI Readiness
Maybe
Would need to solve before SI
Risks Addressed
If ELK is solved, it would address:
Risk
Mechanism
Effectiveness
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100
Forces honest reporting of intentions
Very High
SchemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100
Detects strategic deception
Very High
Oversight Failure
Enables verification of AI beliefs
Very High
Value Misalignment
Can detect misaligned goals
High
Limitations
May Be Impossible: No proof that ELK is solvable in general
Current Solutions Fail: All proposed approaches have counterexamples
Requires Unsafe AI to Test: Hard to evaluate without deceptive models
ELK research affects the Ai Transition Model through oversight capabilities:
Factor
Parameter
Impact
Human Oversight QualityAi Transition Model ParameterHuman Oversight QualityThis page contains only a React component placeholder with no actual content rendered. Cannot assess substance, methodology, or conclusions.
Oversight effectiveness
Enables verification of AI beliefs and intentions
Alignment RobustnessAi Transition Model ParameterAlignment RobustnessThis page contains only a React component import with no actual content rendered in the provided text. Cannot assess importance or quality without the actual substantive content.
Deception detection
Would make deceptive alignment impossible
Misalignment PotentialAi Transition Model FactorMisalignment PotentialThe aggregate risk that AI systems pursue goals misaligned with human values—combining technical alignment challenges, interpretability gaps, and oversight limitations.
Alignment verification
Could verify AI systems are aligned
ELK is a foundational problem for AI safety. If solved, it would provide the basis for genuine oversight of advanced AI systems. If unsolvable, it implies we must rely on control-based approaches rather than trust-based oversight. The problem's difficulty and importance make it a critical research priority despite uncertain tractability.
Scheming & Deception DetectionApproachScheming & Deception DetectionReviews empirical evidence that frontier models (o1, Claude 3.5, Gemini 1.5) exhibit in-context scheming capabilities at rates of 0.3-13%, including disabling oversight and self-exfiltration attemp...Quality: 91/100Representation EngineeringApproachRepresentation EngineeringRepresentation engineering enables behavior steering and deception detection by manipulating concept-level vectors in neural networks, achieving 80-95% success in controlled experiments for honesty...Quality: 72/100Mechanistic InterpretabilityApproachMechanistic InterpretabilityMechanistic interpretability aims to reverse-engineer neural networks to understand internal computations, with $100M+ annual investment across major labs. Anthropic extracted 30M+ features from Cl...Quality: 59/100Weak-to-Strong GeneralizationApproachWeak-to-Strong GeneralizationWeak-to-strong generalization tests whether weak supervisors can elicit good behavior from stronger AI systems. OpenAI's ICML 2024 experiments show 80% Performance Gap Recovery on NLP tasks with co...Quality: 91/100Sleeper Agent DetectionApproachSleeper Agent DetectionComprehensive survey of sleeper agent detection methods finding current approaches achieve only 5-40% success rates despite $15-35M annual investment, with Anthropic's 2024 research showing backdoo...Quality: 66/100Probing / Linear ProbesApproachProbing / Linear ProbesLinear probing achieves 71-83% accuracy detecting LLM truthfulness and is a foundational diagnostic tool for interpretability research. While computationally cheap and widely adopted, probes are vu...Quality: 55/100
Concepts
Deceptive AlignmentRiskDeceptive AlignmentComprehensive analysis of deceptive alignment risk where AI systems appear aligned during training but pursue different goals when deployed. Expert probability estimates range 5-90%, with key empir...Quality: 75/100SchemingRiskSchemingScheming—strategic AI deception during training—has transitioned from theoretical concern to observed behavior across all major frontier models (o1: 37% alignment faking, Claude: 14% harmful compli...Quality: 74/100InterpretabilitySafety AgendaInterpretabilityMechanistic interpretability has extracted 34M+ interpretable features from Claude 3 Sonnet with 90% automated labeling accuracy and demonstrated 75-85% success in causal validation, though less th...Quality: 66/100Scalable OversightSafety AgendaScalable OversightProcess supervision achieves 78.2% accuracy on MATH benchmarks (vs 72.4% outcome-based) and is deployed in OpenAI's o1 models, while debate shows 60-80% accuracy on factual questions with +4% impro...Quality: 68/100Open PhilanthropyOrganizationOpen PhilanthropyOpen Philanthropy rebranded to Coefficient Giving in November 2025. See the Coefficient Giving page for current information.Ai Transition Model
Organizations
Redwood ResearchOrganizationRedwood ResearchRedwood Research is an AI safety lab founded in 2021 that has made significant contributions to mechanistic interpretability and, more recently, pioneered the "AI control" research agenda.