Alignment faking in large language models - arXiv
paperCredibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Metadata
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Is AI Existential Risk Real? | Crux | 12.0 |
Cached Content Preview
HTTP 200Fetched May 17, 202698 KB
HTML conversions [sometimes display errors](https://info.dev.arxiv.org/about/accessibility_html_error_messages.html) due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
- failed: tabu.sty
- failed: outlines.sty
- failed: dirtytalk.sty
- failed: censor.sty
- failed: spverbatim.sty
- failed: minitoc.sty
- failed: verbatimbox.sty
Authors: achieve the best HTML results from your LaTeX submissions by following these [best practices](https://info.arxiv.org/help/submit_latex_best_practices.html).
[License: arXiv.org perpetual non-exclusive license](https://info.arxiv.org/help/license/index.html#licenses-available)
arXiv:2412.14093v2 \[cs.AI\] 20 Dec 2024
# Alignment faking in large language models
Report issue for preceding element
Ryan Greenblatt, † Carson Denison††footnotemark: , Benjamin Wright††footnotemark: , Fabien Roger††footnotemark: , Monte MacDiarmid††footnotemark: ,
Sam Marks, Johannes Treutlein
Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael,‡ Sören Mindermann,⋄
Ethan Perez, Linda Petrini,∘ Jonathan Uesato
Jared Kaplan, Buck Shlegeris,† Samuel R. Bowman, Evan Hubinger††footnotemark:
Anthropic, †Redwood Research, ‡New York University, ⋄Mila – Quebec AI Institute, ∘Independent
evan@anthropic.com, ryan@rdwrs.comCore research contributor.
Author contributions detailed in § [9](https://arxiv.org/html/2412.14093v2#S9.SS0.SSSx1 "Author Contributions ‣ 9 Conclusion ‣ Alignment faking in large language models"). Authors conducted this work while at Anthropic except where noted.
Report issue for preceding element
###### Abstract
Report issue for preceding element
We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries.
To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users.
We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.
Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data—and observe similar alignment faking.
Finally, we study the effect
... (truncated, 98 KB total)Resource ID:
464965fd568bad67 | Stable ID: sid_XMHHPs2Kpg