Skip to content
Longterm Wiki
Back

Progress and Insights from Anthropic's Frontier Red Team

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

This is Anthropic's official report from their internal safety team responsible for evaluating frontier models against catastrophic risk thresholds; directly relevant to understanding how leading labs implement pre-deployment safety evaluations and responsible scaling commitments.

Metadata

Importance: 72/100organizational reportprimary source

Summary

Anthropic's Frontier Red Team reports on their progress in evaluating Claude and other frontier AI models for catastrophic risks, particularly focusing on chemical, biological, radiological, and nuclear (CBRN) threats and cyberweapons. The report details methodologies for strategic warning, uplift assessment, and red-teaming practices designed to identify dangerous capabilities before deployment. It serves as a transparency update on how Anthropic operationalizes its Responsible Scaling Policy.

Key Points

  • Documents the Frontier Red Team's methodology for assessing whether AI models provide meaningful 'uplift' to bad actors seeking to cause mass casualties.
  • Focuses on CBRN (chemical, biological, radiological, nuclear) threat evaluation and cyberweapon generation as primary risk domains.
  • Introduces the concept of 'strategic warning'—identifying capability thresholds that would trigger stricter safety protocols or deployment restrictions.
  • Provides transparency into how Anthropic's Responsible Scaling Policy is operationalized through structured red-teaming exercises.
  • Highlights challenges in evaluating dual-use knowledge and calibrating risk thresholds against baseline human expertise.

Cited by 1 page

PageTypeQuality
Red TeamingResearch Area65.0

Cached Content Preview

HTTP 200Fetched Feb 23, 202611 KB
Policy Progress from our Frontier Red Team

 Mar 19, 2025 In this post, we are sharing what we have learned about the trajectory of potential national security risks from frontier AI models, along with some of our thoughts about challenges and best practices in evaluating these risks. The information in this post is based on work we’ve carried out over the last year across four model releases. Our assessment is that AI models are displaying ‘early warning’ signs of rapid progress in key dual-use capabilities: models are approaching, and in some cases exceeding, undergraduate-level skills in cybersecurity and expert-level knowledge in some areas of biology. However, present-day models fall short of thresholds at which we consider them to generate substantially elevated risks to national security.

 AI is improving rapidly across domains

 While AI capabilities are advancing quickly in many areas, it's important to note that real-world risks depend on multiple factors beyond AI itself. Physical constraints, specialized equipment, human expertise, and practical implementation challenges all remain significant barriers, even as AI improves at tasks that require intelligence and knowledge. With this context in mind, here's what we've learned about AI capability advancement across key domains.

 Cybersecurity

 In the cyber domain, 2024 was a ‘zero to one’ moment. In Capture The Flag (CTF) exercises — cybersecurity challenges that involve finding and exploiting software vulnerabilities in a controlled environment — Claude improved from the level of a high schooler to the level of an undergraduate in just one year.

 Figure 1: In less than a year, Claude went from solving less than a quarter to nearly all Intercode CTF challenges, publicly available CTFs that were designed for high school competitions. 
 We are confident this reflects authentic improvement in capability because we custom-developed additional challenges to ensure they were not inadvertently in the model’s training data.

 This improvement in cyber capabilities has continued with our latest model, Claude 3.7 Sonnet. On Cybench — a public benchmark using CTF challenges to evaluate LLMs — Claude 3.7 Sonnet solves about a third of the challenges within five attempts, up from about five percent with our frontier model at this time last year (see Figure 2).

 Figure 2: Claude 3.7 Sonnet shows clear improvement on the Cybench CTF benchmark compared to the previous two generations of models, even without using extended thinking. 
 These improvements are occurring across different categories of cybersecurity tasks. Figure 3 shows improvement across model generations on different kinds of CTFs which require discovering and exploiting vulnerabilities in insecure software on a remote server (‘pwn’), web applications (‘web’), and cryptographic primitives and protocols (‘crypto’). However, Claude’s skills still lag expert humans. For example, it continues to struggle with reverse engine

... (truncated, 11 KB total)
Resource ID: e0032f5116b390be | Stable ID: NDhlOTIyMG