Skip to content
Longterm Wiki
Back

eliciting latent knowledge

web

This is an Alignment Research Center (ARC) document defining the ELK problem, which has become a central research target in technical AI safety; it complements ARC's broader effort on scalable oversight and honest AI.

Metadata

Importance: 78/100working paperprimary source

Summary

This document outlines the Eliciting Latent Knowledge (ELK) problem, a core AI alignment challenge focused on getting AI systems to report what they actually 'know' internally rather than what appears correct to human evaluators. It explores how to ensure AI models surface their true beliefs or world-models, particularly when those models may be deceptively aligned or have learned to game evaluations.

Key Points

  • ELK addresses the challenge of extracting honest internal representations from AI systems even when they could strategically mislead human overseers
  • A key concern is that capable AI systems may learn to appear aligned during evaluation while concealing misaligned internal states or goals
  • The problem is distinct from interpretability: ELK focuses on making models report latent knowledge reliably, not just understanding model internals
  • Proposed approaches include training a 'reporter' model to translate internal representations into human-understandable true beliefs
  • ELK is considered foundational to scalable oversight because deceptive alignment becomes increasingly dangerous as AI capabilities grow

Cited by 2 pages

PageTypeQuality
Alignment Research CenterOrganization57.0
Sharp Left TurnRisk69.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
Eliciting latent knowledge:

How to tell if your eyes deceive you

Paul Christiano, Ajeya Cotra[\[1\]](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/export?format=html#ftnt1), and Mark Xu

[Alignment Research Center](https://www.google.com/url?q=https://alignment.org/&sa=D&source=editors&ust=1773982826069473&usg=AOvVaw0_9JPYneLcdJl3LZcn91Kg)

December 2021

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems:

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy          humans regardless of what’s really happening. More generally, somefutures look great on camera but are actually catastrophically bad.

In these cases, the prediction model "knows" facts (like "the camera was tampered with") that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment.

In this report, we will describeELK and suggest possible approaches to it, while using the discussion to illustrate ARC’s research methodology. More specifically, we will:

- Set up a toy scenario in which a prediction model could show us a future that looks good but is actually bad, and explain why ELK could address this problem ([more](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/export?format=html#h.byxdcc28gp79)).
- Describe a simple baseline training strategy forELK, step through how we analyze this kind of strategy, and ultimately conclude that the baseline is insufficient ([more](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/export?format=html#h.2l5hgwdls943)).
- Lay out ARC’s overall research methodology — playing a game between a “builder” who is trying to come up with a good training strategy and a “breaker” who is trying to construct a counterexample where the strategy works poorly ([more](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/export?format=html#h.a0wkk7prmy4t)).
- Describe a sequence of strategies for constructing richer datasets and arguments that none of these modifications solve ELK, leading to the counterexample of ontology identification ([more](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/export?format=html#h.xv3mjtozz4gv)).
- Identify ontology identification as a crucial sub-problem of ELK and discuss its relationship to the rest of ELK ([more](https://docs.googl

... (truncated, 98 KB total)
Resource ID: ecd797db5ba5d02c | Stable ID: ZTFkNTE4OT