LLMs Learn by Introspection
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
This paper investigates whether LLMs can perform introspection—acquiring knowledge from internal states rather than training data—with implications for model interpretability and understanding how models develop and access their own representations.
Paper Details
Metadata
Abstract
Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.
Summary
This paper investigates whether large language models can introspect—acquiring knowledge about their own internal states and behavioral tendencies that goes beyond their training data. The researchers define introspection operationally by finetuning models to predict their own behavior in hypothetical scenarios, then testing whether a model predicts itself better than other models trained on its ground-truth outputs. Experiments with GPT-4, GPT-4o, and Llama-3 show that models do outperform baselines at self-prediction and maintain accuracy even when their actual behavior is modified, suggesting genuine introspective capability. However, this introspection is limited to simple tasks and fails on complex tasks requiring out-of-distribution generalization.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Situational Awareness | Capability | 67.0 |
Cached Content Preview
# Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix J Binder
UC San Diego
Stanford University
&James Chua∗
Truthful AI
&Tomek Korbak
Independent
&Henry Sleight
MATS Program
&John Hughes
Speechmatics
&Robert Long
Eleos AI
&Ethan Perez
Anthropic
&Miles Turpin
Scale AI
New York University
&Owain Evans
UC Berkeley
Truthful AI
denotes equal contribution.
###### Abstract
Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect?
We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states.
Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model’s internal workings, we could simply ask the model about its beliefs, world models, and goals.
More speculatively, an introspective model might self-report on whether it possesses certain internal states—such as subjective feelings or desires—and this could inform us about the moral status of these states. Importantly, such self-reports would not be entirely dictated by the model’s training data.
We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios.
For example, “Given the input P𝑃P, would your output favor the short- or long-term option?”
If a model M1𝑀1M1 can introspect, it should outperform a different model M2𝑀2M2 in predicting M1𝑀1M1’s behavior—even if M2𝑀2M2 is trained on M1𝑀1M1’s ground-truth behavior.
The idea is that M1𝑀1M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2𝑀2M2 (even if M2𝑀2M2 is generally stronger).
In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1𝑀1M1 outperforms M2𝑀2M2 in predicting itself, providing evidence for introspection.
Notably, M1𝑀1M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior.
However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.
## 1 Introduction
Do language models have knowledge about themselves that is neither contained in their training data nor inferable from it?
In this paper, we investigate a surprising capability of LLMs: they can obtain knowledge about themselves through introspection.
Introspection in LLMs is significant due to its potential benefits and risks ( [Section 7](https://ar5iv.labs.arxiv.org/html/2410.13787#S7 "7 Motivation: Benefits and Risks of introspection in LLMs ‣ Looking Inward: Language Models Can Learn About Themselves by Introspection")).
An introspective model can answer questions abou
... (truncated, 98 KB total)cff705f66e61842d | Stable ID: YjA0MzQxMj