LLMs Learn by Introspection

paper

2024·arXiv·arxiv.org/abs/2410.13787

Authors

Felix J Binder·James Chua·Tomek Korbak·Henry Sleight·John Hughes·Robert Long·Ethan Perez·Miles Turpin·Owain Evans

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper investigates whether LLMs can perform introspection—acquiring knowledge from internal states rather than training data—with implications for model interpretability and understanding how models develop and access their own representations.

Paper Details

Citations

9 influential

Year

2024

Methodology

book-chapter

Metadata

arxiv preprintprimary source

Abstract

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

Summary

This paper investigates whether large language models can introspect—acquiring knowledge about their own internal states and behavioral tendencies that goes beyond their training data. The researchers define introspection operationally by finetuning models to predict their own behavior in hypothetical scenarios, then testing whether a model predicts itself better than other models trained on its ground-truth outputs. Experiments with GPT-4, GPT-4o, and Llama-3 show that models do outperform baselines at self-prediction and maintain accuracy even when their actual behavior is modified, suggesting genuine introspective capability. However, this introspection is limited to simple tasks and fails on complex tasks requiring out-of-distribution generalization.

Cited by 1 page

Page	Type	Quality
Situational Awareness	Capability	67.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2410.13787] Looking Inward: Language Models Can Learn About Themselves by Introspection 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Looking Inward: Language Models Can Learn About Themselves by Introspection

 
 
 Felix J Binder 
 UC San Diego
 Stanford University 
 &James Chua ∗ 
 Truthful AI 
 &Tomek Korbak 
 Independent 
 &Henry Sleight 
 MATS Program 
 &John Hughes 
 Speechmatics 
 &Robert Long 
 Eleos AI 
 &Ethan Perez 
 Anthropic 
 &Miles Turpin 
 Scale AI 
 New York University 
 &Owain Evans 
 UC Berkeley 
 Truthful AI 
 
 denotes equal contribution. 
 

 
 Abstract

 Humans acquire knowledge by observing the external world, but also by introspection . Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect?
We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states.
Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model’s internal workings, we could simply ask the model about its beliefs, world models, and goals.

 More speculatively, an introspective model might self-report on whether it possesses certain internal states—such as subjective feelings or desires—and this could inform us about the moral status of these states. Importantly, such self-reports would not be entirely dictated by the model’s training data.

 We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios.
For example, “ Given the input P 𝑃 P , would your output favor the short- or long-term option? ”
If a model M  1 𝑀 1 M1 can introspect, it should outperform a different model M  2 𝑀 2 M2 in predicting M  1 𝑀 1 M1 ’s behavior—even if M  2 𝑀 2 M2 is trained on M  1 𝑀 1 M1 ’s ground-truth behavior.
The idea is that M  1 𝑀 1 M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M  2 𝑀 2 M2 (even if M  2 𝑀 2 M2 is generally stronger).

 In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M  1 𝑀 1 M1 outperforms M  2 𝑀 2 M2 in predicting itself, providing evidence for introspection.
Notably, M  1 𝑀 1 M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior.
However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

 
 
 
 1 Introduction

 
 Do language models have knowledge about themselves that is neither contained in their training data nor inferable from it?
In this paper, we investigate a surprising capability of LLMs: they can obtain knowledge about themselves through introspection.

 
 
 Introspection in LLMs is significant due to its potential benefits and risks ( Section 7 ).
A

... (truncated, 98 KB total)

Resource ID: cff705f66e61842d | Stable ID: sid_n1DUlDFYLQ