Meng et al., 2023
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
This paper investigates mechanistic interpretability of transformer language models by identifying and analyzing localized computations responsible for factual recall, enabling direct editing of model factual associations—crucial for understanding and controlling model behavior.
Paper Details
Metadata
Abstract
We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, we also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available at https://rome.baulab.info/
Summary
This paper investigates how transformer language models store and recall factual information, discovering that factual associations are encoded in localized, editable computations within middle-layer feed-forward modules. The authors develop causal intervention techniques to identify neurons critical for factual predictions and introduce Rank-One Model Editing (ROME), a method for directly modifying model weights to update specific facts. ROME demonstrates effectiveness on both standard model-editing benchmarks and a new counterfactual dataset, maintaining both specificity and generalization better than existing approaches. The findings suggest that factual knowledge can be directly manipulated through targeted modifications of feed-forward computations.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Goal Misgeneralization Probability Model | Analysis | 61.0 |
Cached Content Preview
# Locating and Editing Factual Associations in GPT
Kevin Meng
MIT CSAIL
&David Bau∗
Northeastern University
&Alex Andonian
MIT CSAIL
&Yonatan Belinkov
Technion – IIT
Equal contribution. Correspondence to mengk@mit.edu, davidbau@northeastern.edu.Supported by the Viterbi Fellowship in the Center for Computer Engineering at the Technion.
###### Abstract
We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model’s factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task. We also evaluate ROME on a new dataset of difficult counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available
at [https://rome.baulab.info/](https://rome.baulab.info/ "").
## 1 Introduction
Where does a large language model store its facts? In this paper, we report evidence that factual associations in GPT correspond to a localized computation that can be directly edited.
Large language models can predict factual statements about the world (Petroni et al., [2019](https://ar5iv.labs.arxiv.org/html/2202.05262#bib.bib30 ""); Jiang et al., [2020](https://ar5iv.labs.arxiv.org/html/2202.05262#bib.bib21 ""); Roberts et al., [2020](https://ar5iv.labs.arxiv.org/html/2202.05262#bib.bib34 "")). For example, given the prefix “ _The Space Needle is located in the city of_,” GPT will reliably predict the true answer: “ _Seattle_” (Figure [1](https://ar5iv.labs.arxiv.org/html/2202.05262#S1.F1 "Figure 1 ‣ 1 Introduction ‣ Locating and Editing Factual Associations in GPT") a). Factual knowledge has been observed to emerge in both autoregressive GPT models (Radford et al., [2019](https://ar5iv.labs.arxiv.org/html/2202.05262#bib.bib32 ""); Brown et al., [2020](https://ar5iv.labs.arxiv.org/html/2202.05262#bib.bib7 "")) and masked BERT models (Devlin et al., [2019](https://ar5iv.labs.arxiv.org/html/2202.05262#bib.bib11 "")).
In this paper, we investigate how such factual associations are stored within GPT-like autoregressive transformer models. Although many of the larges
... (truncated, 98 KB total)ec7db6149c2a02f7 | Stable ID: YjE5OTVjNm