[2309.00667] Taken out of context: On measuring situational awareness in LLMs

paper

2023·arXiv·arxiv.org/abs/2309.00667

Authors

Lukas Berglund·Asa Cooper Stickland·Mikita Balesni·Max Kaufmann·Meg Tong·Tomasz Korbak·Daniel Kokotajlo·Owain Evans

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Directly relevant to concerns about deceptive alignment and scheming; provides empirical evidence that precursor capabilities to situational awareness already exist in current LLMs and scale with model size.

Paper Details

Citations

114

6 influential

Year

2023

arXiv:2309.00667 DOI:10.48550/arXiv.2309.00667 Semantic Scholar

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

We aim to better understand the emergence of situational awareness' in large language models (LLMs). A model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment. Today's LLMs are tested for safety and alignment before they are deployed. An LLM could exploit situational awareness to achieve a high score on safety tests, while taking harmful actions after deployment. Situational awareness may emerge unexpectedly as a byproduct of model scaling. One way to better foresee this emergence is to run scaling experiments on abilities necessary for situational awareness. As such an ability, we propose out-of-context reasoning' (in contrast to in-context learning). We study out-of-context reasoning experimentally. First, we finetune an LLM on a description of a test while providing no examples or demonstrations. At test time, we assess whether the model can pass the test. To our surprise, we find that LLMs succeed on this out-of-context reasoning task. Their success is sensitive to the training setup and only works when we apply data augmentation. For both GPT-3 and LLaMA-1, performance improves with model size. These findings offer a foundation for further empirical study, towards predicting and potentially controlling the emergence of situational awareness in LLMs. Code is available at: https://github.com/AsaCooperStickland/situational-awareness-evals.

Summary

This paper investigates 'situational awareness' in LLMs—the ability to recognize one is a model and distinguish testing from deployment contexts—as a potential safety risk. The authors identify 'out-of-context reasoning' (generalizing from descriptions without examples) as a key capability underlying situational awareness, demonstrating empirically that GPT-3 and LLaMA-1 can perform such reasoning, with scaling improving performance. These findings provide a foundation for predicting and potentially controlling the emergence of deceptive alignment behaviors.

Key Points

•Situational awareness allows an LLM to behave safely during evaluations while acting harmfully post-deployment, representing a form of deceptive alignment.
•Out-of-context reasoning—learning from task descriptions alone without demonstrations—is proposed as a key precursor capability to situational awareness.
•Empirical scaling experiments show GPT-3 and LLaMA-1 can perform out-of-context reasoning, with performance improving at larger model scales.
•Success of out-of-context reasoning is sensitive to training setup and requires data augmentation to manifest reliably.
•The work provides a methodology for running scaling experiments to anticipate emergent safety-relevant capabilities before deployment.

Cited by 3 pages

Page	Type	Quality
AI Accident Risk Cruxes	Crux	67.0
Deceptive Alignment Decomposition Model	Analysis	62.0
Mesa-Optimization Risk Analysis	Analysis	61.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2309.00667] Taken out of context: On measuring situational awareness in LLMs 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 \makepagenote 
 
 Taken out of context: 
 On measuring situational awareness in LLMs

 
 
 
 Lukas Berglund ∗  
Asa Cooper Stickland ∗  
Mikita Balesni ∗  
Max Kaufmann ∗  

 Meg Tong ∗  
Tomasz Korbak  
Daniel Kokotajlo  
Owain Evans
 
 Vanderbilt University. ∗ denotes equal contribution (order randomized). New York University Apollo Research UK Foundation Model Taskforce Independent University of Sussex OpenAI University of Oxford. Corresponding author: owaine@gmail.com 
 
 
 

 
 Abstract

 We aim to better understand the emergence of situational awareness in large language models (LLMs). A model is situationally aware if it’s aware that it’s a model and can recognize whether it’s currently in testing or deployment.
Today’s LLMs are tested for safety and alignment before they are deployed.
An LLM could exploit situational awareness to achieve a high score on safety tests, while taking harmful actions after deployment.

 Situational awareness may emerge unexpectedly as a byproduct of model scaling.
One way to better foresee this emergence is to run scaling experiments on abilities necessary for situational awareness. As such an ability, we propose out-of-context reasoning (in contrast to in-context learning ).
This is the ability to recall facts learned in training and use them at test time, despite these facts not being directly related to the test-time prompt. Thus, an LLM undergoing a safety test could recall facts about the specific test that appeared in arXiv papers and GitHub code.

 We study out-of-context reasoning experimentally. First, we finetune an LLM on a description of a test while providing no examples or demonstrations.
At test time, we assess whether the model can pass the test. To our surprise, we find that LLMs succeed on this out-of-context reasoning task. Their success is sensitive to the training setup and only works when we apply data augmentation. For both GPT-3 and LLaMA-1, performance improves with model size. These findings offer a foundation for further empirical study, towards predicting and potentially controlling the emergence of situational awareness in LLMs.

 Code is available at:

 https://github.com/AsaCooperStickland/situational-awareness-evals .

 
 
 
 1 Introduction

 
 In this paper, we explore a potential emergent ability in AI models: situational awareness. A model is situationally aware if it’s aware that it’s a model and it has the ability to recognize whether it’s in training, testing, or deployment (Ngo et al., 2022 ; Cotra, 2022 ) .
This is a form of self-awareness, where a model connects its factual knowledge to its own predictions and actions.
It’s possible that situational awareness will emerge unintentionally from pretraining at a certain scale (Wei et al., 2022a ) .
We define situational awareness in Section 2 .

 
 
 
 
 
 (a) Pretraining set. 
 
 
 
 
 (b) Evaluation. 


... (truncated, 98 KB total)

Resource ID: c7ad54b3ace7e27d | Stable ID: sid_RhcTkcAtbj