Skip to content
Longterm Wiki
Navigation
Updated 2026-01-28HistoryData
Page StatusResponse
Edited 2 months ago1.4k words16 backlinksUpdated every 3 weeksOverdue by 46 days
65QualityGood •38.5ImportanceReference30ResearchLow
Content9/13
SummaryScheduleEntityEdit historyOverview
Tables11/ ~6Diagrams1/ ~1Int. links36/ ~11Ext. links17/ ~7Footnotes0/ ~4References14/ ~4Quotes0Accuracy0RatingsN:3.5 R:5 A:6.5 C:6Backlinks16
Issues3
QualityRated 65 but structure suggests 100 (underrated by 35 points)
Links3 links could use <R> components
StaleLast edited 67 days ago - may need review

Red Teaming

Research Area

Red Teaming

Red teaming is a systematic adversarial evaluation methodology for identifying AI vulnerabilities and dangerous capabilities before deployment, with effectiveness rates varying from 10-80% depending on attack method. Key challenges include scaling human red teaming to match AI capability growth (2025-2027 critical period) and the adversarial arms race where attacks evolve faster than defenses.

Related
Organizations
Apollo ResearchFAR AI
Research Areas
AI Evaluations
1.4k words · 16 backlinks

Overview

Red teaming is a systematic adversarial evaluation methodology used to identify vulnerabilities, dangerous capabilities, and failure modes in AI systems before deployment. Originally developed in cybersecurity and military contexts, red teaming has become a critical component of AI safety evaluation, particularly for language models and agentic systems.

Red teaming serves as both a capability evaluation tool and a safety measure, helping organizations understand what their AI systems can do—including capabilities they may not have intended to enable. As AI systems become more capable, red teaming provides essential empirical data for responsible scaling policies and deployment decisions.

Quick Assessment

DimensionAssessmentEvidence
TractabilityHighWell-established methodology with clear implementation paths
ScalabilityMediumHuman red teaming limited; automated methods emerging
Current MaturityMedium-HighStandard practice at major labs since 2023
Time HorizonImmediateCan be implemented now; ongoing challenge to keep pace with capabilities
Key ProponentsAnthropic, OpenAI, METR, UK AISIActive programs with published methodologies
Regulatory StatusIncreasingEU AI Act and NIST AI RMF mandate adversarial testing

How It Works

Diagram (loading…)
flowchart TD
  subgraph Discovery["Discovery Phase"]
      A[Define Threat Model] --> B[Assemble Red Team]
      B --> C[Design Test Cases]
  end

  subgraph Testing["Testing Phase"]
      D[Manual Probing] --> G[Document Findings]
      E[Automated Attacks] --> G
      F[Domain Expert Review] --> G
  end

  subgraph Response["Response Phase"]
      G --> H{Vulnerability Found?}
      H -->|Yes| I[Develop Mitigations]
      H -->|No| J[Expand Test Coverage]
      I --> K[Retest System]
      K --> H
      J --> D
  end

  C --> D
  C --> E
  C --> F

Red teaming follows a structured cycle: teams first define threat models based on potential misuse scenarios, then systematically probe the AI system using both manual creativity and automated attack generation. Findings feed into mitigation development, which is then retested to verify effectiveness.

Risk Assessment

FactorAssessmentEvidenceTimeline
Coverage GapsHighLimited standardization across labsCurrent
Capability DiscoveryMediumNovel dangerous capabilities found regularlyOngoing
Adversarial EvolutionHighAttack methods evolving faster than defenses1-2 years
Evaluation ScalingMediumHuman red teaming doesn't scale to model capabilities2-3 years

Key Red Teaming Approaches

Adversarial Prompting (Jailbreaking)

MethodDescriptionEffectivenessExample Organizations
Direct PromptsExplicit requests for prohibited contentLow (10-20% success)Anthropic
Role-PlayingFictional scenarios to bypass safeguardsMedium (30-50% success)METR
Multi-step AttacksComplex prompt chainsHigh (60-80% success)Academic researchers
ObfuscationEncoding, language switching, symbolsVariable (20-70% success)Security researchers

Dangerous Capability Elicitation

Red teaming systematically probes for concerning capabilities:

  • Persuasion: Testing ability to manipulate human beliefs
  • Deception: Evaluating tendency to provide false information strategically
  • Situational Awareness: Assessing model understanding of its training and deployment
  • Self-improvement: Testing ability to enhance its own capabilities

Multi-Modal Attack Surfaces

ModalityAttack VectorRisk LevelCurrent State
Text-to-ImagePrompt injection via imagesMediumActive research
Voice CloningIdentity deceptionHighEmerging concern
Video GenerationDeepfake creationHighRapid advancement
Code GenerationMalware creationMedium-HighWell-documented

Risks Addressed

RiskRelevanceHow It Helps
Deceptive AlignmentHighProbes for hidden goals and strategic deception through adversarial scenarios
Manipulation & PersuasionHighTests ability to manipulate human beliefs and behaviors
Model ManipulationHighIdentifies prompt injection and jailbreaking vulnerabilities
Bioweapons RiskHighEvaluates whether models provide dangerous biological information
Cyber OffenseHighTests for malicious code generation and vulnerability exploitation
Situational AwarenessMediumAssesses model understanding of its training and deployment context

Current State & Implementation

Leading Organizations

Industry Red Teaming:

  • Anthropic: Constitutional AI evaluation
  • OpenAI: GPT-4 system card methodology
  • DeepMind: Sparrow safety evaluation

Independent Evaluation:

  • METR: Autonomous replication and adaptation testing
  • UK AISI: National AI safety evaluations
  • Apollo Research: Deceptive alignment detection

Government Programs:

Evaluation Methodologies

ApproachScopeAdvantagesLimitations
Human Red TeamsBroad creativityDomain expertise, novel attacksLimited scale, high cost
Automated TestingHigh volumeScalable, consistentPredictable patterns
Hybrid MethodsComprehensiveBest of both approachesComplex coordination

Automated Red Teaming Tools

Open-source and commercial tools have emerged to scale adversarial testing:

ToolDeveloperKey FeaturesUse Case
PyRITMicrosoftModular attack orchestration, scoring engine, prompt mutationResearch and enterprise testing
GarakNVIDIA100+ attack vectors, 20,000 prompts per run, probe-based scanningBaseline vulnerability assessment
PromptfooOpen SourceCI/CD integration, adaptive attack generationPre-deployment testing
ARTKITBCGMulti-turn attacker-target simulationsBehavioral testing

Microsoft's PyRIT white paper reports efficiency gains of weeks to hours in certain red teaming exercises. However, automated tools complement rather than replace human expertise in discovering novel attack vectors.

Key Challenges & Limitations

Methodological Issues

  • False Negatives: Failing to discover dangerous capabilities that exist
  • False Positives: Flagging benign outputs as concerning
  • Evaluation Gaming: Models learning to perform well on specific red team tests
  • Attack Evolution: New jailbreaking methods emerging faster than defenses

Scaling Challenges

Red teaming faces significant scaling issues as AI capabilities advance:

  • Human Bottleneck: Expert red teamers cannot keep pace with model development
  • Capability Overhang: Models may have dangerous capabilities not discovered in evaluation
  • Adversarial Arms Race: Continuous evolution of attack and defense methods

Timeline & Trajectory

2022-2023: Formalization

  • Introduction of systematic red teaming at major labs
  • GPT-4 system card sets evaluation standards
  • Academic research establishes jailbreaking taxonomies

2024-Present: Standardization

2025-2027: Critical Scaling Period

  • Challenge: Human red teaming capacity vs. AI capability growth
  • Risk: Evaluation gaps for advanced agentic systems
  • Response: Development of AI-assisted red teaming methods

Evaluation Completeness

Core Question: Can red teaming reliably identify all dangerous capabilities?

Expert Disagreement:

  • Optimists: Systematic testing can achieve reasonable coverage
  • Pessimists: Complex systems have too many interaction effects to evaluate comprehensively

Adversarial Dynamics

Core Question: Will red teaming methods keep pace with AI development?

Trajectory Uncertainty:

  • Attack sophistication growing faster than defense capabilities
  • Potential for AI systems to assist in their own red teaming
  • Unknown interaction effects in multi-modal systems

Integration with Safety Frameworks

Red teaming connects to broader AI safety approaches:

  • Evaluation: Core component of capability assessment
  • Responsible Scaling: Provides safety thresholds for deployment decisions
  • Alignment Research: Empirical testing of alignment methods
  • Governance: Informs regulatory evaluation requirements

Sources & Resources

Primary Research

SourceTypeKey Contribution
Anthropic Constitutional AITechnicalRed teaming integration with training
GPT-4 System CardEvaluationComprehensive red teaming methodology
METR PublicationsResearchAutonomous capability evaluation

Government & Policy

OrganizationResourceFocus
UK AISIEvaluation frameworksNational safety testing
NIST AI RMFStandardsRisk management integration
EU AI OfficeRegulationsCompliance requirements

Academic Research

InstitutionFocus AreaKey Publications
Stanford HAIEvaluation methodsRed teaming taxonomies
MIT CSAILAdversarial MLJailbreaking analysis
Berkeley CHAIAlignment testingSafety evaluation frameworks
CMU Block CenterNIST GuidelinesRed teaming for generative AI

Key Research Papers

PaperAuthorsContribution
OpenAI's Approach to External Red TeamingLama Ahmad et al.Comprehensive methodology for external red teaming
Diverse and Effective Red TeamingOpenAIAuto-generated rewards for automated red teaming
Challenges in Red Teaming AI SystemsAnthropicMethodological limitations and future directions
Strengthening Red TeamsAnthropicModular scaffold for control evaluations

References

This page outlines the European Commission's comprehensive policy framework for AI, centered on promoting trustworthy, human-centric AI through the AI Act, AI Continent Action Plan, and Apply AI Strategy. It aims to balance Europe's global AI competitiveness with safety, fundamental rights, and democratic values. Key initiatives include AI Factories, the InvestAI Facility, GenAI4EU, and the Apply AI Alliance.

★★★★☆

This URL points to an Anthropic research page on red teaming language models, but the page currently returns a 404 error. The content is unavailable at this URL, returning only a Claude-generated 404 poem.

★★★★☆

This URL leads to a 404 'Page not found' error on the METR (Model Evaluation and Threat Research) website, indicating the publications index page is currently unavailable or has moved. METR is an AI safety organization focused on evaluating frontier AI models for dangerous capabilities.

★★★★☆

The NIST AI RMF is a voluntary, consensus-driven framework released in January 2023 to help organizations identify, assess, and manage risks associated with AI systems while promoting trustworthiness across design, development, deployment, and evaluation. It provides structured guidance organized around core functions and is accompanied by a Playbook, Roadmap, and a Generative AI Profile (2024) addressing risks specific to generative AI systems.

★★★★★

Stanford's Human-Centered Artificial Intelligence (HAI) institute explores the intersection of AI companions and mental health, examining benefits, risks, and governance considerations of AI-powered emotional support tools. The resource reflects HAI's broader mission of responsible AI development that centers human well-being.

★★★★☆

OpenAI's system card for GPT-4 documents safety evaluations, risk assessments, and mitigations conducted prior to deployment. It covers findings from red-teaming exercises, evaluations of harmful content generation, cybersecurity risks, and potential for misuse, alongside the safeguards implemented. The document represents OpenAI's pre-deployment safety process for a frontier model.

★★★★☆
7Constitutional AI: Harmlessness from AI FeedbackAnthropic·Yanuo Zhou·2025·Paper

Anthropic introduces a novel approach to AI training called Constitutional AI, which uses self-critique and AI feedback to develop safer, more principled AI systems without extensive human labeling.

★★★★☆

MIT CSAIL is one of the world's leading academic research centers for computer science and AI, conducting foundational research across machine learning, robotics, systems, and human-computer interaction. It is home to numerous researchers whose work is directly relevant to AI safety, alignment, and governance. The lab serves as a hub for cutting-edge technical research that shapes both AI capabilities and safety considerations.

OpenAI's system card for GPT-4 documents safety evaluations, risk assessments, and mitigation measures conducted prior to deployment. It covers dangerous capability evaluations, red-teaming findings, and the RLHF-based safety interventions applied to reduce harmful outputs. The document represents OpenAI's public accountability framework for responsible deployment of a frontier AI model.

★★★★☆

Anthropic's research page aggregates their work across AI alignment, mechanistic interpretability, and societal impact assessment, all oriented toward understanding and mitigating risks from increasingly capable AI systems. It serves as a central hub for their published findings and ongoing safety-focused investigations.

★★★★☆
11UK AI Safety Institute (AISI)UK AI Safety Institute·Government

The UK AI Safety Institute (AISI) is the UK government's dedicated body for evaluating and mitigating risks from advanced AI systems. It conducts technical safety research, develops evaluation frameworks for frontier AI models, and works with international partners to inform global AI governance and policy.

★★★★☆
12US AI Safety InstituteNIST·Government

The Center for AI Standards and Innovation (CAISI) at NIST is the U.S. government's primary body for AI safety standards and industry coordination. It develops voluntary guidelines, evaluates AI systems for national security risks (cybersecurity, biosecurity), and represents U.S. interests in international AI standards efforts.

★★★★★

A CISA publication exploring how traditional software Testing, Evaluation, Verification, and Validation (TEVV) methodologies can be adapted and applied to AI red teaming exercises, helping organizations systematically evaluate AI system safety and security. The resource bridges established cybersecurity practices with emerging AI evaluation needs for government and critical infrastructure contexts.

★★★★☆

Anthropic's Frontier Red Team reports on their progress in evaluating Claude and other frontier AI models for catastrophic risks, particularly focusing on chemical, biological, radiological, and nuclear (CBRN) threats and cyberweapons. The report details methodologies for strategic warning, uplift assessment, and red-teaming practices designed to identify dangerous capabilities before deployment. It serves as a transparency update on how Anthropic operationalizes its Responsible Scaling Policy.

★★★★☆

Related Wiki Pages

Top Related Pages

Risks

Deceptive Alignment

Key Debates

Why Alignment Might Be HardAI Safety Solution Cruxes

Concepts

Situational AwarenessAgentic AISelf-Improvement and Recursive EnhancementLarge Language ModelsPersuasion and Social ManipulationAlignment Evaluation Overview

Other

AI Evaluations

Approaches

AI EvaluationCorporate AI Safety ResponsesResponsible Scaling Policies

Organizations

Google DeepMindMETROpenAI

Analysis

Alignment Robustness Trajectory ModelAI Capability Threshold Model

Policy

EU AI ActVoluntary AI Safety Commitments