Skip to content
Longterm Wiki
Back

Representation Engineering: A Top-Down Approach to AI Transparency

paper

Authors

Andy Zou·Long Phan·Sarah Chen·James Campbell·Phillip Guo·Richard Ren·Alexander Pan·Xuwang Yin·Mantas Mazeika·Ann-Kathrin Dombrowski·Shashwat Goel·Nathaniel Li·Michael J. Byun·Zifan Wang·Alex Mallen·Steven Basart·Sanmi Koyejo·Dawn Song·Matt Fredrikson·J. Zico Kolter·Dan Hendrycks

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper introduces representation engineering, a method for enhancing AI transparency by analyzing and manipulating population-level representations in deep neural networks, directly addressing the interpretability and control challenges central to AI safety.

Paper Details

Citations
831
116 influential
Year
2023

Metadata

arxiv preprintprimary source

Abstract

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

Summary

This paper introduces representation engineering (RepE), a top-down approach to AI transparency that analyzes population-level representations in deep neural networks rather than individual neurons. Drawing from cognitive neuroscience, RepE provides methods for monitoring and manipulating high-level cognitive phenomena in large language models. The authors demonstrate that RepE techniques can effectively address safety-relevant problems including honesty, harmlessness, and power-seeking behavior, offering a promising direction for improving AI system transparency and control.

Cited by 5 pages

PageTypeQuality
Center for AI SafetyOrganization42.0
Eliciting Latent Knowledge (ELK)Approach91.0
AI EvaluationApproach72.0
Probing / Linear ProbesApproach55.0
Representation EngineeringApproach72.0

1 FactBase fact citing this source

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Representation Engineering:   A Top-Down Approach to AI Transparency

Andy Zou
Center for AI Safety
Carnegie Mellon University
Long Phan∗Center for AI Safety
Sarah Chen∗Center for AI Safety
Stanford University
James Campbell∗Cornell University
Phillip Guo∗University of Maryland
Richard Ren∗University of Pennsylvania
Alexander Pan
UC Berkeley
Xuwang Yin
Center for AI Safety
Mantas Mazeika
Center for AI Safety
University of Illinois Urbana-Champaign
Ann-Kathrin Dombrowski
Center for AI Safety

Shashwat Goel
Center for AI Safety
Nathaniel Li
Center for AI Safety
UC Berkeley
Michael J. Byun
Stanford University
Zifan Wang
Center for AI Safety

Alex Mallen
EleutherAI
Steven Basart
Center for AI Safety
Sanmi Koyejo
Stanford University
Dawn Song
UC Berkeley

Matt Fredrikson
Carnegie Mellon University
Zico Kolter
Carnegie Mellon University
Dan Hendrycks
Center for AI Safety

###### Abstract

We identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems. Code is available at [github.com/andyzoujm/representation-engineering](https://github.com/andyzoujm/representation-engineering "").

††∗Equal contribution. Correspondence to: [andyzou@cmu.edu](mailto:andyzou@cmu.edu "")![Refer to caption](https://ar5iv.labs.arxiv.org/html/2310.01405/assets/x1.png)Figure 1: Overview of topics in the paper. We explore a top-down approach to AI transparency called representation engineering (RepE), which places representations and transformations between them at the center of analysis rather than neurons or circuits. Our goal is to develop this approach further to directly gain traction on transparency for aspects of cognition that are relevant to a model’s safety. We highlight applications of RepE to honesty and hallucination ( [Section4](https://ar5iv.labs.arxiv.org/html/2310.01405#S4 "4 In Depth Example of RepE: Honesty ‣ Representation Engineering: A Top-Down Approach to AI Transparency")), utility ( [Section5.1](https://ar5iv.labs.arxiv.org/html/2310.01405#S5.SS1 "5.1 Utility ‣ 5 In Depth Example of RepE: Ethics and Power ‣ Representation Engineering: A Top-Down Approach to AI Transparency")), power-ave

... (truncated, 98 KB total)
Resource ID: 5d708a72c3af8ad9 | Stable ID: ODBhZmZhNj