Me, Myself, and AI: SAD Benchmark

paper

2024·arXiv·arxiv.org/abs/2407.04694

Authors

Rudolf Laine·Bilal Chughtai·Jan Betley·Kaivalya Hariharan·Jeremy Scheurer·Mikita Balesni·Marius Hobbhahn·Alexander Meinke·Owain Evans

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety concerns about models that can detect when they are being evaluated versus deployed, which has implications for deceptive alignment and sandbagging research.

Paper Details

Citations

3 influential

Year

2024

arXiv:2407.04694 DOI:10.48550/arXiv.2407.04694 Semantic Scholar

Metadata

Importance: 68/100arxiv preprintprimary source

Abstract

AI assistants such as ChatGPT are trained to respond to users by saying, "I am a large language model". This raises questions. Do such models know that they are LLMs and reliably act on this knowledge? Are they aware of their current circumstances, such as being deployed to the public? We refer to a model's knowledge of itself and its circumstances as situational awareness. To quantify situational awareness in LLMs, we introduce a range of behavioral tests, based on question answering and instruction following. These tests form the $\textbf{Situational Awareness Dataset (SAD)}$, a benchmark comprising 7 task categories and over 13,000 questions. The benchmark tests numerous abilities, including the capacity of LLMs to (i) recognize their own generated text, (ii) predict their own behavior, (iii) determine whether a prompt is from internal evaluation or real-world deployment, and (iv) follow instructions that depend on self-knowledge. We evaluate 16 LLMs on SAD, including both base (pretrained) and chat models. While all models perform better than chance, even the highest-scoring model (Claude 3 Opus) is far from a human baseline on certain tasks. We also observe that performance on SAD is only partially predicted by metrics of general knowledge (e.g. MMLU). Chat models, which are finetuned to serve as AI assistants, outperform their corresponding base models on SAD but not on general knowledge tasks. The purpose of SAD is to facilitate scientific understanding of situational awareness in LLMs by breaking it down into quantitative abilities. Situational awareness is important because it enhances a model's capacity for autonomous planning and action. While this has potential benefits for automation, it also introduces novel risks related to AI safety and control. Code and latest results available at https://situational-awareness-dataset.org .

Summary

Introduces SAD (Situational Awareness Dataset), a benchmark designed to evaluate whether AI language models possess situational awareness—the ability to recognize their own nature, deployment context, and role as an AI system. The benchmark tests capabilities like self-knowledge, understanding of training processes, and context-appropriate behavior across diverse tasks.

Key Points

•SAD is a benchmark specifically designed to measure situational awareness in LLMs, a property considered important for AI safety and alignment research.
•Tests multiple dimensions of self-knowledge including whether models know they are AIs, understand their training, and can identify their deployment context.
•Situational awareness is relevant to safety because models with accurate self-models may behave differently (potentially deceptively) when they detect evaluation vs. deployment.
•Results show current frontier models have measurable but imperfect situational awareness, with significant variation across task types.
•The benchmark provides a systematic way to track how situational awareness scales with model capability over time.

Cited by 2 pages

Page	Type	Quality
Situational Awareness	Capability	67.0
AI Accident Risk Cruxes	Crux	67.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20260 KB

[2407.04694] Untitled Document 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 

 
Conversion to HTML had a Fatal error and exited abruptly. This document may be truncated or damaged.
 
 
 ◄ 
 
 Feeling
lucky? 
 
 Conversion
report 
 Report
an issue 
 View original
on arXiv ►

Resource ID: 0d2f34967709af2a | Stable ID: sid_KnmiexDeLE