Skip to content
Longterm Wiki
Back

LLMs Learn by Introspection

paper

Authors

Felix J Binder·James Chua·Tomek Korbak·Henry Sleight·John Hughes·Robert Long·Ethan Perez·Miles Turpin·Owain Evans

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper investigates whether LLMs can perform introspection—acquiring knowledge from internal states rather than training data—with implications for model interpretability and understanding how models develop and access their own representations.

Paper Details

Citations
56
9 influential
Year
2024
Methodology
book-chapter
Categories
Learn Objective-C for Java Developers

Metadata

arxiv preprintprimary source

Abstract

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

Summary

This paper investigates whether large language models can introspect—acquiring knowledge about their own internal states and behavioral tendencies that goes beyond their training data. The researchers define introspection operationally by finetuning models to predict their own behavior in hypothetical scenarios, then testing whether a model predicts itself better than other models trained on its ground-truth outputs. Experiments with GPT-4, GPT-4o, and Llama-3 show that models do outperform baselines at self-prediction and maintain accuracy even when their actual behavior is modified, suggesting genuine introspective capability. However, this introspection is limited to simple tasks and fails on complex tasks requiring out-of-distribution generalization.

Cited by 1 page

PageTypeQuality
Situational AwarenessCapability67.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix J Binder

UC San Diego

Stanford University

&James Chua∗

Truthful AI

&Tomek Korbak

Independent

&Henry Sleight

MATS Program

&John Hughes

Speechmatics

&Robert Long

Eleos AI

&Ethan Perez

Anthropic

&Miles Turpin

Scale AI

New York University

&Owain Evans

UC Berkeley

Truthful AI

denotes equal contribution.

###### Abstract

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect?
We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states.
Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model’s internal workings, we could simply ask the model about its beliefs, world models, and goals.

More speculatively, an introspective model might self-report on whether it possesses certain internal states—such as subjective feelings or desires—and this could inform us about the moral status of these states. Importantly, such self-reports would not be entirely dictated by the model’s training data.

We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios.
For example, “Given the input P𝑃P, would your output favor the short- or long-term option?”
If a model M​1𝑀1M1 can introspect, it should outperform a different model M​2𝑀2M2 in predicting M​1𝑀1M1’s behavior—even if M​2𝑀2M2 is trained on M​1𝑀1M1’s ground-truth behavior.
The idea is that M​1𝑀1M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M​2𝑀2M2 (even if M​2𝑀2M2 is generally stronger).

In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M​1𝑀1M1 outperforms M​2𝑀2M2 in predicting itself, providing evidence for introspection.
Notably, M​1𝑀1M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior.
However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.

## 1 Introduction

Do language models have knowledge about themselves that is neither contained in their training data nor inferable from it?
In this paper, we investigate a surprising capability of LLMs: they can obtain knowledge about themselves through introspection.

Introspection in LLMs is significant due to its potential benefits and risks ( [Section 7](https://ar5iv.labs.arxiv.org/html/2410.13787#S7 "7 Motivation: Benefits and Risks of introspection in LLMs ‣ Looking Inward: Language Models Can Learn About Themselves by Introspection")).
An introspective model can answer questions abou

... (truncated, 98 KB total)
Resource ID: cff705f66e61842d | Stable ID: YjA0MzQxMj