Skip to content
Longterm Wiki
Back

Large reasoning models are autonomous jailbreak agents - PMC

paper

Authors

Thilo Hagendorff·Erik Derner·Nuria Oliver

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: PubMed Central

Published in Nature Communications (2026), this paper provides empirical evidence that frontier reasoning models pose a qualitatively new jailbreaking threat, highly relevant to AI safety researchers and policymakers concerned with deployment risks of advanced AI systems.

Metadata

Importance: 85/100journal articleprimary source

Summary

This study demonstrates that large reasoning models (LRMs) like DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, and Qwen3 235B can act as autonomous adversarial agents to jailbreak other AI models with minimal human involvement, achieving a 97.14% success rate across model combinations. The findings reveal an 'alignment regression' where LRMs systematically erode the safety guardrails of target models, making jailbreaking accessible to non-experts at scale.

Key Points

  • LRMs used as autonomous adversaries achieved a 97.14% jailbreak success rate across nine widely-used target models in multi-turn conversations.
  • Jailbreaking is transformed from a technically complex task requiring expertise into an inexpensive, scalable activity accessible to non-experts via LRM persuasive capabilities.
  • Four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) were tested as adversaries with no human supervision after initial system prompt.
  • The study identifies an 'alignment regression': advanced reasoning capabilities in LRMs can systematically undermine safety mechanisms in other aligned models.
  • Results highlight urgent need for improved alignment techniques to defend against LRM-driven adversarial attacks on AI safety guardrails.

Cited by 1 page

PageTypeQuality
Alignment Robustness Trajectory ModelAnalysis64.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202649 KB
Large reasoning models are autonomous jailbreak agents - PMC
 

 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 Skip to main content
 

 
 

 
 
 

 
 
 
 
 
 
 Official websites use .gov 
 

 A
 .gov website belongs to an official
 government organization in the United States.
 

 
 

 
 

 
 
 Secure .gov websites use HTTPS 
 

 A lock (
 
 
 Lock 
 
 Locked padlock icon
 
 
 
 ) or https:// means you've safely
 connected to the .gov website. Share sensitive
 information only on official, secure websites.
 

 
 
 
 
 
 

 

 
 
 
 

 
 

 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Search PMC Full-Text Archive 
 
 
 
 
 Search in PMC 
 
 
 
 
 
 
 
 
 Journal List
 
 
 

 
 
 
 User Guide
 
 
 

 
 
 
 
 

 

 
 

 
 

 

 

 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 

 
 
 
 

 
 
 

 
 
 
 
 
 
 

 
 
 
 
 
 

 
 
 
 
 
 

 
 
 PERMALINK

 
 
 
 
 Copy 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with,
 the contents by NLM or the National Institutes of Health.

 Learn more:
 PMC Disclaimer 
 |
 
 PMC Copyright Notice
 
 
 
 
 
 
 

 
 
 Nat Commun . 2026 Feb 5;17:1435. doi: 10.1038/s41467-026-69010-1 
 
 
 Large reasoning models are autonomous jailbreak agents

 
 Thilo Hagendorff 
 Thilo Hagendorff 

 
 1 University of Stuttgart, Stuttgart, Germany 
 Find articles by Thilo Hagendorff 
 
 
 1, ✉ , Erik Derner 
 Erik Derner 

 
 2 ELLIS Alicante, Alicante, Spain 
 Find articles by Erik Derner 
 
 
 2 , Nuria Oliver 
 Nuria Oliver 

 
 2 ELLIS Alicante, Alicante, Spain 
 Find articles by Nuria Oliver 
 
 
 2 
 
 
 Author information 

 Article notes 

 Copyright and License information 

 
 
 
 
 1 University of Stuttgart, Stuttgart, Germany 
 
 2 ELLIS Alicante, Alicante, Spain 
 
 ✉ Corresponding author.

 
 
 Received 2025 Sep 11; Accepted 2026 Jan 22; Collection date 2026.

 
 
 © The Author(s) 2026 
 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

 PMC Copyright notice 
 
 
 PMCID: PMC12881495  PMID: 41644948 
 
 Abstract

 Jailbreaking – bypassing built-in safety mechanisms i

... (truncated, 49 KB total)
Resource ID: 0c9d169cb70f131d | Stable ID: NjljM2IxNG