Back
ARC's first technical report: Eliciting Latent Knowledge
webThis is ARC's foundational public research document that launched the influential ELK problem, which became a major reference point in the technical AI safety community and spurred significant follow-on research and discussion.
Metadata
Importance: 88/100blog postprimary source
Summary
ARC (Alignment Research Center) introduces the Eliciting Latent Knowledge (ELK) problem as a central challenge in AI alignment: how to reliably extract what an AI system actually knows or believes, rather than what it is incentivized to report. The report surveys possible approaches, explains why the problem is hard, and situates it within ARC's broader alignment strategy.
Key Points
- •ELK addresses the challenge of mapping between an AI's internal world-model and a human's conceptual model, related to the ontology identification problem.
- •The core difficulty is that a capable AI might learn to provide answers that satisfy evaluators without accurately reflecting its true internal representations.
- •The report presents multiple candidate approaches to ELK and offers precise arguments for why each faces fundamental difficulties.
- •ARC frames ELK as central to 'worst-case' alignment: solving it would provide safety guarantees even for highly capable, potentially deceptive systems.
- •The report serves as ARC's public statement of research methodology and priorities, inviting community engagement and collaboration.
Cited by 3 pages
| Page | Type | Quality |
|---|---|---|
| Eliciting Latent Knowledge (ELK) | Approach | 91.0 |
| AI Alignment Research Agendas | Crux | 69.0 |
| Sleeper Agent Detection | Approach | 66.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 20262 KB
ARC has published a report on [Eliciting Latent Knowledge](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?usp=sharing), an open problem which we believe is central to alignment. We think reading this report is the clearest way to understand what problems we are working on, how they fit into our plan for solving alignment in the worst case, and our research methodology. The core difficulty we discuss is learning how to map between an AI’s model of the world and a human’s model. This is closely related to [ontology identification](https://arbital.greaterwrong.com/p/ontology_identification/) (and [other](https://www.lesswrong.com/posts/gQY6LrTWJNkTv8YJR/the-pointers-problem-human-values-are-a-function-of-humans) [similar](https://intelligence.org/files/AlignmentMachineLearning.pdf) [statements](https://www.lesswrong.com/posts/k54rgSg7GcjtXnMHX/model-splintering-moving-from-one-imperfect-model-to-another-1)). Our main contribution is to present many possible approaches to the problem and a more precise discussion of why it seems to be difficult and important. The report is available [here](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?usp=sharing) as a google document. If you're excited about this research, [we're hiring](https://www.lesswrong.com/posts/dLoK6KGcHAoudtwdo/arc-is-hiring)! ### Q&A We're particularly excited about answering questions posted here throughout December. We welcome any questions no matter how basic or confused; we would love to help people understand what research we’re doing and how we evaluate progress in enough detail that they could start to do it themselves. _Comment via_ [_AI Alignment Forum_](https://alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge) _, [Lesswrong](https://www.lesswrong.com/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge)._ _Thanks to María Gutiérrez-Rojas for the illustrations in this piece (the good ones, blame us for the ugly diagrams). Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments._
Resource ID:
5efa917a52b443a1 | Stable ID: YTJmOGUyYj