ARC's first technical report: Eliciting Latent Knowledge

web

alignment.org·alignment.org/blog/arcs-first-technical-report-eliciting-...

This is ARC's foundational public research document that launched the influential ELK problem, which became a major reference point in the technical AI safety community and spurred significant follow-on research and discussion.

Metadata

Importance: 88/100blog postprimary source

Summary

ARC (Alignment Research Center) introduces the Eliciting Latent Knowledge (ELK) problem as a central challenge in AI alignment: how to reliably extract what an AI system actually knows or believes, rather than what it is incentivized to report. The report surveys possible approaches, explains why the problem is hard, and situates it within ARC's broader alignment strategy.

Key Points

•ELK addresses the challenge of mapping between an AI's internal world-model and a human's conceptual model, related to the ontology identification problem.
•The core difficulty is that a capable AI might learn to provide answers that satisfy evaluators without accurately reflecting its true internal representations.
•The report presents multiple candidate approaches to ELK and offers precise arguments for why each faces fundamental difficulties.
•ARC frames ELK as central to 'worst-case' alignment: solving it would provide safety guarantees even for highly capable, potentially deceptive systems.
•The report serves as ARC's public statement of research methodology and priorities, inviting community engagement and collaboration.

Cited by 3 pages

Page	Type	Quality
Eliciting Latent Knowledge (ELK)	Approach	91.0
AI Alignment Research Agendas	Crux	69.0
Sleeper Agent Detection	Approach	66.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20262 KB

ARC&#x27;s first technical report: Eliciting Latent Knowledge — Alignment Research Center

ARC has published a report on Eliciting Latent Knowledge , an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology.

The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to ontology identification (and other similar statements ). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important.

The report is available here as a google document. If you're excited about this research, we're hiring !

Q&A

We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves.

Comment via AI Alignment Forum , Lesswrong .

Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.

Resource ID: 5efa917a52b443a1 | Stable ID: sid_yeV00XPoe8