Back
Introspection Research
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Transformer Circuits
From Anthropic's transformer circuits team; relevant to questions of AI transparency and whether model self-reports can be trusted for safety-critical evaluations.
Metadata
Importance: 72/100blog postprimary source
Summary
This Anthropic research investigates whether large language models can accurately introspect on their own internal computations and states. It examines the reliability of model self-reports about their reasoning processes, knowledge, and beliefs, with implications for AI transparency and deception detection.
Key Points
- •Explores whether AI models can accurately report on their own internal states, reasoning processes, and knowledge boundaries
- •Has direct implications for AI safety: if models cannot reliably introspect, self-reports about goals or values may be untrustworthy
- •Investigates the gap between what models claim about themselves and what mechanistic interpretability reveals about their actual computations
- •Relevant to evaluating deceptive alignment risks—models may report benign intentions without accurate self-knowledge
- •Part of Anthropic's transformer circuits / mechanistic interpretability research program
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Situational Awareness | Capability | 67.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202698 KB
[Transformer Circuits Thread](https://transformer-circuits.pub/)
# Emergent Introspective Awareness in Large Language Models
-->
### Authors
### Affiliations
### Published
_Not published yet._
### DOI
_No DOI yet._
### Author
Jack Lindsey
### Affiliations
[Anthropic](https://www.anthropic.com/)
### Published
October 29th, 2025
Correspondence to [jacklindsey@anthropic.com](mailto:jacklindsey@anthropic.com)
We investigate whether large language models are aware of their own internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional awareness of their own internal states. We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.
* * *
## [Introduction](https://transformer-circuits.pub/2025/introspection/index.html\#introduction)
Humans, and likely some animals, possess the remarkable capacity for introspection: the ability to observe and reason about their own thoughts. As AI systems perform increasingly impressive feats of cognition, it is natural to wonder whether they possess any similar awareness of their internal states. Modern language models can appear to demonstrate introspection, sometimes making assertions about their own thought processes, intentions, and knowledge. However, this apparent introspection can be, and often is, an illusion. Language models may simply make up claims about their mental states, without these claims being grounded in genuine internal examination. After all, models are trained on data that include demonstrations of introspection, providing them with a playbook for acting like introspective agents, regardless of whether they are. Nevertheless, these confabulations do not preclude the possibility that AI models
... (truncated, 98 KB total)Resource ID:
f735848613f8e654 | Stable ID: N2JiOTk3OT