Stress-Testing Capability Elicitation With Password-Locked Models - arXiv
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
This paper introduces password-locked models as an evaluation framework to test whether capability elicitation methods can extract hidden LLM abilities, directly relevant to understanding deceptive alignment risks and model transparency in AI safety.
Paper Details
Metadata
Summary
This paper introduces 'password-locked models' as a novel evaluation framework for testing capability elicitation methods in large language models. The researchers fine-tune LLMs to hide certain capabilities behind a password prompt while imitating weaker models otherwise, then attempt to elicit these hidden capabilities without the password using various methods. The study finds that fine-tuning with few high-quality demonstrations can effectively elicit password-locked capabilities, and surprisingly, can even elicit capabilities locked with different passwords. However, the authors note that fine-tuning becomes unreliable when high-quality demonstrations are unavailable, which has important implications for evaluating models with capabilities exceeding human demonstrator abilities.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Redwood Research | Organization | 78.0 |
Cached Content Preview
[2405.19550] Stress-Testing Capability Elicitation With Password-Locked Models
Stress-Testing Capability Elicitation With Password-Locked Models
Ryan Greenblatt ∗
Redwood Research
ryan@rdwrs.com
&Fabien Roger ∗
Redwood Research
fabien.d.roger@gmail.com
&Dmitrii Krasheninnikov
University of Cambridge
dk655@cam.ac.uk
&David Krueger
University of Cambridge
david.scott.krueger@gmail.com
Abstract
To determine the safety of large language models (LLMs),
AI developers must be able to assess their dangerous capabilities.
But simple prompting strategies often fail to elicit an LLM’s full capabilities.
One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task.
In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities.
To do this, we introduce password-locked models , LLMs fine-tuned such that some of their capabilities are deliberately hidden.
Specifically, these LLMs are trained to exhibit these capabilities
only when a password is present in the prompt, and to imitate a much weaker LLM otherwise.
Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password.
We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities.
More surprisingly, fine-tuning can elicit other capabilities that have been locked using the same password, or even different passwords.
Furthermore, when only evaluations, and not demonstrations, are available, approaches like reinforcement learning are still often able to elicit capabilities.
Overall, our findings suggest that fine-tuning is an effective method of eliciting hidden capabilities of current models, but may be unreliable when high-quality demonstrations are not available, e.g. as may be the case when models’ (hidden) capabilities exceed those of human demonstrators.
* * footnotetext: Equal contribution. Author contributions are listed in Appendix A .
Code available at https://github.com/FabienRoger/sandbagging .
1 Introduction
To make training and deployment decisions, large language model (LLM) developers rely on measuring dangerous capabilities of the systems they create (Anthropic, 2023 ; OpenAI, 2023 ) .
Per the executive order on artificial intelligence (White House, 2023 ) , such companies are required to “develop tools to evaluate AI capabilities to generate outputs that may represent nuclear, nonproliferation, biological, chemical, critical infrastructure, and energy-security threats or hazards”.
Expert forecasters expect some of these dangerous capabilities to materialize in the coming years (Phuong et al., 2024 ) .
Failing to elicit LLM capabilities and thus underestimating their potential for harm could lead to catastrophic deployment decisions. 1 1
... (truncated, 88 KB total)2bb15dbd965ba994 | Stable ID: MDljNzVlOD