Large reasoning models are autonomous jailbreak agents - PMC
paperAuthors
Credibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: PubMed Central
Published in Nature Communications (2026), this paper provides empirical evidence that frontier reasoning models pose a qualitatively new jailbreaking threat, highly relevant to AI safety researchers and policymakers concerned with deployment risks of advanced AI systems.
Metadata
Summary
This study demonstrates that large reasoning models (LRMs) like DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, and Qwen3 235B can act as autonomous adversarial agents to jailbreak other AI models with minimal human involvement, achieving a 97.14% success rate across model combinations. The findings reveal an 'alignment regression' where LRMs systematically erode the safety guardrails of target models, making jailbreaking accessible to non-experts at scale.
Key Points
- •LRMs used as autonomous adversaries achieved a 97.14% jailbreak success rate across nine widely-used target models in multi-turn conversations.
- •Jailbreaking is transformed from a technically complex task requiring expertise into an inexpensive, scalable activity accessible to non-experts via LRM persuasive capabilities.
- •Four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) were tested as adversaries with no human supervision after initial system prompt.
- •The study identifies an 'alignment regression': advanced reasoning capabilities in LRMs can systematically undermine safety mechanisms in other aligned models.
- •Results highlight urgent need for improved alignment techniques to defend against LRM-driven adversarial attacks on AI safety guardrails.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Alignment Robustness Trajectory Model | Analysis | 64.0 |
Cached Content Preview
Large reasoning models are autonomous jailbreak agents - PMC
Skip to main content
Official websites use .gov
A
.gov website belongs to an official
government organization in the United States.
Secure .gov websites use HTTPS
A lock (
Lock
Locked padlock icon
) or https:// means you've safely
connected to the .gov website. Share sensitive
information only on official, secure websites.
Search PMC Full-Text Archive
Search in PMC
Journal List
User Guide
PERMALINK
Copy
As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with,
the contents by NLM or the National Institutes of Health.
Learn more:
PMC Disclaimer
|
PMC Copyright Notice
Nat Commun . 2026 Feb 5;17:1435. doi: 10.1038/s41467-026-69010-1
Large reasoning models are autonomous jailbreak agents
Thilo Hagendorff
Thilo Hagendorff
1 University of Stuttgart, Stuttgart, Germany
Find articles by Thilo Hagendorff
1, ✉ , Erik Derner
Erik Derner
2 ELLIS Alicante, Alicante, Spain
Find articles by Erik Derner
2 , Nuria Oliver
Nuria Oliver
2 ELLIS Alicante, Alicante, Spain
Find articles by Nuria Oliver
2
Author information
Article notes
Copyright and License information
1 University of Stuttgart, Stuttgart, Germany
2 ELLIS Alicante, Alicante, Spain
✉ Corresponding author.
Received 2025 Sep 11; Accepted 2026 Jan 22; Collection date 2026.
© The Author(s) 2026
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
PMC Copyright notice
PMCID: PMC12881495 PMID: 41644948
Abstract
Jailbreaking – bypassing built-in safety mechanisms i
... (truncated, 49 KB total)0c9d169cb70f131d | Stable ID: NjljM2IxNG