Meng et al., 2023

paper

2022·arXiv·arxiv.org/abs/2202.05262

Authors

Kevin Meng·David Bau·Alex Andonian·Yonatan Belinkov

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper investigates mechanistic interpretability of transformer language models by identifying and analyzing localized computations responsible for factual recall, enabling direct editing of model factual associations—crucial for understanding and controlling model behavior.

Paper Details

Citations

473 influential

Year

2022

arXiv:2202.05262 DOI:10.7717/peerj.19824/fig-8 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, we also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available at https://rome.baulab.info/

Summary

This paper investigates how transformer language models store and recall factual information, discovering that factual associations are encoded in localized, editable computations within middle-layer feed-forward modules. The authors develop causal intervention techniques to identify neurons critical for factual predictions and introduce Rank-One Model Editing (ROME), a method for directly modifying model weights to update specific facts. ROME demonstrates effectiveness on both standard model-editing benchmarks and a new counterfactual dataset, maintaining both specificity and generalization better than existing approaches. The findings suggest that factual knowledge can be directly manipulated through targeted modifications of feed-forward computations.

Cited by 1 page

Page	Type	Quality
Goal Misgeneralization Probability Model	Analysis	61.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202698 KB

[2202.05262] Locating and Editing Factual Associations in GPT 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Locating and Editing Factual Associations in GPT

 
 
 Kevin Meng 
 MIT CSAIL 
 &David Bau ∗ 
 Northeastern University 
 &Alex Andonian 
 MIT CSAIL 
 &Yonatan Belinkov 
 Technion – IIT 
 
 Equal contribution. Correspondence to mengk@mit.edu , davidbau@northeastern.edu .Supported by the Viterbi Fellowship in the Center for Computer Engineering at the Technion. 
 

 
 Abstract

 We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model’s factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task. We also evaluate ROME on a new dataset of difficult counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available
at https://rome.baulab.info/ .

 
 
 
 1 Introduction

 
 Where does a large language model store its facts? In this paper, we report evidence that factual associations in GPT correspond to a localized computation that can be directly edited.

 
 
 Large language models can predict factual statements about the world  (Petroni et al., 2019 ; Jiang et al., 2020 ; Roberts et al., 2020 ) . For example, given the prefix “ The Space Needle is located in the city of ,” GPT will reliably predict the true answer: “ Seattle ” (Figure  1 a). Factual knowledge has been observed to emerge in both autoregressive GPT models  (Radford et al., 2019 ; Brown et al., 2020 ) and masked BERT models  (Devlin et al., 2019 ) .

 
 
 In this paper, we investigate how such factual associations are stored within GPT-like autoregressive transformer models. Although many of the largest neural networks in use today are autoregressive, the way that they store knowledge remains under-explored. Some research has been done for masked models  (Petroni et al., 2019 ; Jiang et al., 2020 ; Elazar et al., 2021a ; Geva et al., 2021 ; Dai et al., 2022 ; De Cao et al., 2021 ) , but GPT has architectural differences such as unidirectional attention and generation capabilit

... (truncated, 98 KB total)

Resource ID: ec7db6149c2a02f7 | Stable ID: sid_KSiPIAOU43