Introspection Research

web

Transformer Circuits·transformer-circuits.pub/2025/introspection/index.html

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Transformer Circuits

From Anthropic's transformer circuits team; relevant to questions of AI transparency and whether model self-reports can be trusted for safety-critical evaluations.

Metadata

Importance: 72/100blog postprimary source

Summary

This Anthropic research investigates whether large language models can accurately introspect on their own internal computations and states. It examines the reliability of model self-reports about their reasoning processes, knowledge, and beliefs, with implications for AI transparency and deception detection.

Key Points

•Explores whether AI models can accurately report on their own internal states, reasoning processes, and knowledge boundaries
•Has direct implications for AI safety: if models cannot reliably introspect, self-reports about goals or values may be untrustworthy
•Investigates the gap between what models claim about themselves and what mechanistic interpretability reveals about their actual computations
•Relevant to evaluating deceptive alignment risks—models may report benign intentions without accurate self-knowledge
•Part of Anthropic's transformer circuits / mechanistic interpretability research program

Cited by 1 page

Page	Type	Quality
Situational Awareness	Capability	67.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202698 KB

Emergent Introspective Awareness in Large Language Models 
 

 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 Transformer Circuits Thread 
 
 
 Emergent Introspective Awareness in Large Language Models

 
 -->

 

 
 
 
 Author

 
 
 
 Jack Lindsey 
 
 
 
 
 
 Affiliations

 
 Anthropic 
 
 
 
 Published

 
 October 29th, 2025 
 
 
 
 Correspondence to jacklindsey@anthropic.com 
 
 
 
 
 

 
 We investigate whether large language models are aware of their own internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional awareness of their own internal states. We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.

 

 Introduction 

 Humans, and likely some animals, possess the remarkable capacity for introspection: the ability to observe and reason about their own thoughts. As AI systems perform increasingly impressive feats of cognition, it is natural to wonder whether they possess any similar awareness of their internal states. Modern language models can appear  to demonstrate introspection, sometimes making assertions about their own thought processes, intentions, and knowledge. However, this apparent introspection can be, and often is, an illusion. Language models may simply make up claims about their mental states, without these claims being grounded in genuine internal examination. After all, models are trained on data that include demonstrations of introspection, providing them with a playbook for acting  like introspective agents, regardless of whether they are. Nevertheless, these confabulations do not preclude the possibility that AI models can , at times, genuinely introspect, even if they do not always do so.

 How can we test for genuine introspection in lan

... (truncated, 98 KB total)

Resource ID: f735848613f8e654 | Stable ID: sid_4NGnLfBqxO