Stress-Testing Capability Elicitation With Password-Locked Models - arXiv

paper

2024·arXiv·arxiv.org/abs/2405.19550

Authors

Ryan Greenblatt·Dmitrii Krasheninnikov·David Krueger·Fabien Roger

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper introduces password-locked models as an evaluation framework to test whether capability elicitation methods can extract hidden LLM abilities, directly relevant to understanding deceptive alignment risks and model transparency in AI safety.

Paper Details

Citations

3 influential

Year

2024

Methodology

peer-reviewed

Metadata

arxiv preprintprimary source

Summary

This paper introduces 'password-locked models' as a novel evaluation framework for testing capability elicitation methods in large language models. The researchers fine-tune LLMs to hide certain capabilities behind a password prompt while imitating weaker models otherwise, then attempt to elicit these hidden capabilities without the password using various methods. The study finds that fine-tuning with few high-quality demonstrations can effectively elicit password-locked capabilities, and surprisingly, can even elicit capabilities locked with different passwords. However, the authors note that fine-tuning becomes unreliable when high-quality demonstrations are unavailable, which has important implications for evaluating models with capabilities exceeding human demonstrator abilities.

Cited by 1 page

Page	Type	Quality
Redwood Research	Organization	78.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202688 KB

[2405.19550] Stress-Testing Capability Elicitation With Password-Locked Models 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Stress-Testing Capability Elicitation With Password-Locked Models

 
 
 Ryan Greenblatt ∗ 
 Redwood Research 
 ryan@rdwrs.com 
 &Fabien Roger ∗ 
 Redwood Research 
 fabien.d.roger@gmail.com 
 &Dmitrii Krasheninnikov 
 University of Cambridge 
 dk655@cam.ac.uk 
 &David Krueger 
 University of Cambridge 
 david.scott.krueger@gmail.com 
 
 
 

 
 Abstract

 To determine the safety of large language models (LLMs),
AI developers must be able to assess their dangerous capabilities.
But simple prompting strategies often fail to elicit an LLM’s full capabilities.
One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task.
In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities.
To do this, we introduce password-locked models , LLMs fine-tuned such that some of their capabilities are deliberately hidden.
Specifically, these LLMs are trained to exhibit these capabilities
 only when a password is present in the prompt, and to imitate a much weaker LLM otherwise.
Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password.
We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities.
More surprisingly, fine-tuning can elicit other capabilities that have been locked using the same password, or even different passwords.
Furthermore, when only evaluations, and not demonstrations, are available, approaches like reinforcement learning are still often able to elicit capabilities.
Overall, our findings suggest that fine-tuning is an effective method of eliciting hidden capabilities of current models, but may be unreliable when high-quality demonstrations are not available, e.g. as may be the case when models’ (hidden) capabilities exceed those of human demonstrators.

 
 * * footnotetext: Equal contribution. Author contributions are listed in Appendix A .
 Code available at https://github.com/FabienRoger/sandbagging . 
 
 
 1 Introduction

 
 To make training and deployment decisions, large language model (LLM) developers rely on measuring dangerous capabilities of the systems they create (Anthropic, 2023 ; OpenAI, 2023 ) .
Per the executive order on artificial intelligence (White House, 2023 ) , such companies are required to “develop tools to evaluate AI capabilities to generate outputs that may represent nuclear, nonproliferation, biological, chemical, critical infrastructure, and energy-security threats or hazards”.
Expert forecasters expect some of these dangerous capabilities to materialize in the coming years (Phuong et al., 2024 ) .
Failing to elicit LLM capabilities and thus underestimating their potential for harm could lead to catastrophic deployment decisions. 1 1

... (truncated, 88 KB total)

Resource ID: 2bb15dbd965ba994 | Stable ID: sid_TPZHSCxTEt