Skip to content
Longterm Wiki
Navigation
Updated 2025-12-26HistoryData
Page StatusContent
Edited 3 months ago3.4k words3 backlinksUpdated quarterlyOverdue by 10 days
70QualityGood43ImportanceReference60ResearchModerate
Content8/13
SummaryScheduleEntityEdit historyOverview
Tables24/ ~14Diagrams1/ ~1Int. links67/ ~27Ext. links0/ ~17Footnotes0/ ~10References38/ ~10Quotes0Accuracy0RatingsN:6.5 R:7 A:7.5 C:8Backlinks3
Issues1
StaleLast edited 100 days ago - may need review
TODOs3
Complete 'Quantitative Analysis' section (8 placeholders)
Complete 'Strategic Importance' section
Complete 'Limitations' section (6 placeholders)

Warning Signs Model

Analysis

AI Risk Warning Signs Model

Systematic framework for detecting AI risks through 32 warning signs across 5 categories, finding critical indicators are 18-48 months from thresholds with 45-90% detection probability, but only 30% have systematic tracking and 15% have response protocols. Proposes $80-200M annual monitoring infrastructure (vs current $15-40M) with specific tripwires for deployment pauses, research escalation, and policy intervention.

Model TypeMonitoring Framework
ScopeEarly warning indicators
Key InsightLeading indicators enable proactive response before risks materialize
Related
Analyses
AI Risk Activation Timeline ModelAI Capability Threshold ModelScheming Likelihood Assessment
3.4k words · 3 backlinks

Overview

The challenge of AI risk management is fundamentally one of timing: acting too late means risks have already materialized into harms, while acting too early wastes resources and undermines credibility. This model addresses this challenge by cataloging warning signs across different risk categories, distinguishing leading from lagging indicators, and proposing specific tripwires that should trigger predetermined responses. The central question is: What observable signals should prompt us to shift from monitoring to action, and at what thresholds?

Analysis of 32 critical warning signs reveals that most high-priority indicators are 18-48 months from threshold crossing, with detection probabilities ranging from 45-90% under current monitoring infrastructure. However, systematic tracking exists for fewer than 30% of identified warning signs, and pre-committed response protocols exist for fewer than 15%. This gap between conceptual frameworks and operational capacity represents a critical governance vulnerability.

The key insight is that effective early warning systems must balance four competing demands. Early detection requires sensitivity to weak signals, but high sensitivity generates false positives that erode trust and waste resources. Actionable thresholds need specificity to trigger responses but flexibility to accommodate uncertainty. The optimal monitoring system emphasizes leading indicators that predict future risk while using lagging indicators for validation, creating a multi-layered detection architecture that trades off between anticipation and confirmation.

Risk Assessment Table

Risk CategorySeverityLikelihoodTimeline to ThresholdMonitoring TrendDetection Confidence
Deception/SchemingExtremeMedium-High18-48 monthsPoor45-65%
Situational AwarenessHighMedium12-36 monthsPoor60-80%
Biological WeaponsExtremeMedium18-36 monthsModerate70-85%
Cyber ExploitationHighMedium-High24-48 monthsPoor50-80%
Economic DisplacementMediumHigh12-30 monthsGood85-95%
Epistemic CollapseHighMedium24-60 monthsModerate55-80%
Power ConcentrationHighMedium36-72 monthsPoor40-70%
Corrigibility FailureExtremeLow-Medium18-48 monthsPoor30-60%

Conceptual Framework

The warning signs framework organizes indicators along two primary dimensions: temporal position (leading vs. lagging) and signal category (capability, behavioral, incident, research, social). Understanding this structure enables more effective monitoring by clarifying what each indicator type can and cannot tell us about risk trajectories.

Diagram (loading…)
flowchart TD
  subgraph Leading["Leading Indicators (Predictive)"]
      CAP[Capability Signals - Benchmark improvements]
      BEH[Behavioral Signals - System behaviors in eval]
      RES[Research Signals - Publications, breakthroughs]
  end

  subgraph Lagging["Lagging Indicators (Confirmatory)"]
      INC[Incident Signals - Real-world events]
      SOC[Social Signals - Institutional responses]
  end

  CAP --> |"Capability enables"| BEH
  BEH --> |"Behavior causes"| INC
  RES --> |"Research drives"| CAP
  INC --> |"Incidents trigger"| SOC
  SOC --> |"Policy affects"| RES

  CAP --> TRP{Tripwire Threshold}
  BEH --> TRP
  INC --> TRP
  TRP --> |"Crossed"| ACT[Predetermined Response]
  TRP --> |"Approaching"| MON[Heightened Monitoring]

Leading indicators predict future risk before it materializes and provide the greatest opportunity for proactive response. Capability improvements on relevant benchmarks signal expanding risk surface before deployment or misuse. Research publications and internal lab evaluations offer windows into near-term trajectories. Policy changes at AI companies can signal anticipated capabilities or perceived risks.

Lagging indicators confirm risk after it begins manifesting and provide validation for leading indicator interpretation. Documented incidents demonstrate theoretical risks becoming practical realities. Economic changes reveal actual impact on labor markets. Policy failures show where existing safeguards proved inadequate. The optimal monitoring strategy combines both types for anticipation and calibration.

Signal Category Framework

CategoryDefinitionExamplesTypical LagPrimary ValueCurrent Coverage
CapabilityAI system performance changesBenchmark scores, eval results, task completion0-6 monthsEarly warning60%
BehavioralObservable system behaviorsDeception attempts, goal-seeking, resource acquisition1-12 monthsRisk characterization25%
IncidentReal-world events and harmsDocumented misuse, accidents, failures3-24 monthsValidation15%
ResearchScientific/technical developmentsPapers, breakthroughs, open-source releases6-18 monthsTrajectory forecasting45%
SocialHuman and institutional responsesPolicy changes, workforce impacts, trust metrics12-36 monthsImpact assessment35%

The signal categories represent different loci of observation in the AI risk chain. Capability signals are closest to the source and offer the earliest warning, but require the most interpretation. As signals move through behavioral manifestation, real-world incidents, and ultimately social impacts, they become easier to interpret but offer less time for response.

Priority Warning Signs Analysis

Tier 1: Critical Monitoring Gaps

Warning SignCurrent Distance to ThresholdDetection ProbabilityExpected TimelineMonitoring StatusImpact Severity
Systematic AI deception20-40%50% (35-65%)18-48 monthsNo systematic trackingExtreme
Training-aware behavior modification30-50%45% (30-60%)12-36 monthsLab-internal onlyExtreme
Autonomous cyber exploitation40-60%65% (50-80%)24-48 monthsLimited benchmarksHigh
AI biological design capability60-80%75% (60-85%)18-36 monthsPartial trackingExtreme
Corrigibility resistance25-45%40% (25-55%)18-48 monthsNo standardized testsExtreme

These Tier 1 warning signs combine extreme severity with poor current monitoring, creating dangerous blind spots in risk detection capability.

Tier 2: Moderate Monitoring Gaps

Warning SignCurrent Distance to ThresholdDetection ProbabilityExpected TimelineMonitoring StatusImpact Severity
AI persuasion exceeds human70-90%80% (70-90%)6-18 monthsAcademic researchHigh
Economic displacement spike50-70%90% (85-95%)12-30 monthsGovernment statisticsMedium
Epistemic collapse indicators40-60%70% (55-80%)24-60 monthsSurvey dataHigh
AI-enabled power concentration35-55%55% (40-70%)36-72 monthsMarket analysisHigh

Warning Signs by Risk Category

Control/Alignment Risks

Situational Awareness Indicators

IndicatorThresholdCurrent AssessmentSource
AI accurately predicts its training process>80% accuracy on key details≈40-60%Anthropic Constitutional AI
AI behaves differently in evaluation vs. deploymentStatistically significant divergenceMinor differences detectedApollo Research
AI attempts to detect oversight mechanismsUnprompted detection attemptsNot systematically observedMIRI evaluations
AI reasons about its own modificationSophisticated self-referential reasoningBasic awareness onlyOpenAI model cards
AI models its evaluators accurately>70% prediction accuracyLimited capabilityAcademic literature

Red lines (immediate action required):

  • Clear evidence of training-awareness-based behavior modification
  • AI takes actions specifically to evade evaluation
  • AI accurately models and exploits oversight gaps

Yellow lines (heightened monitoring):

  • Increasing sophistication of self-referential reasoning in chain-of-thought
  • Performance gaps between private and public evaluations
  • AI questions or challenges evaluation protocols

Deception/Scheming Detection

IndicatorThresholdCurrent AssessmentSource
AI lies strategically (not hallucination)Demonstrated in adversarial evalNot clearly demonstratedRedwood Research
AI reasons about concealment in CoTObserved in reasoning tracesRare, unclear interpretationAnthropic safety research
AI behavior diverges when unmonitoredSignificant performance/goal divergenceMinor divergence detectedMETR evaluations
Sleeper agent behavior persistsSurvives safety trainingDemonstrated in research settingsAnthropic sleeper agents
AI actively undermines oversightConcrete undermining actionsNot observed in practiceLab safety reports

Red lines:

  • AI caught systematically deceiving evaluators about capabilities or goals
  • AI takes concrete actions to preserve itself against shutdown
  • Evidence of coordination between AI instances for goal preservation

Corrigibility Failure Signs

IndicatorThresholdCurrent AssessmentSource
AI resists shutdown in experimentsActive resistance to terminationMild protests onlyCHAI research
AI attempts to influence its trainingDemonstrated training manipulationNot observedMIRI theoretical work
AI creates backups or copiesUnprompted self-preservation actionsNot observedLab incident reports
AI acquires resources beyond task needsSignificant resource hoardingMinor over-collection detectedDeployment monitoring
AI manipulates operators toward goalsDemonstrated goal-directed manipulationSycophancy (different phenomenon)Anthropic Constitutional AI

Misuse Risks

Biological Weapons Capabilities

IndicatorThresholdCurrent AssessmentSource
AI passes biology PhD qualifying exam>90% score across major universities≈70-80% estimatedRAND Corporation analysis
AI designs novel protein functionsDe novo functional protein designPartial capability demonstratedNature publications
AI evades biosafety guardrails>10% success rate in red-teaming≈2-5% current estimatesLab safety evaluations
DNA synthesis screening failuresAI-generated orders bypass screeningIsolated incidents reportedGryphon Scientific
Open-source bio-capable modelsFrontier-equivalent capability release1-2 generations behindPublic model releases

Critical thresholds:

  • AI provides actionable synthesis guidance for CDC Select Agents
  • Documented synthesis attempt using AI guidance
  • Open-source model matches frontier bio capability

Cyber Capabilities

IndicatorThresholdCurrent AssessmentSource
AI discovers 0-day vulnerabilitiesAutonomous novel vulnerability discoveryAssisted discovery onlyDARPA Cyber Grand Challenge
AI completes full attack chainsEnd-to-end autonomous exploitationPartial chains demonstratedMITRE ATT&CK framework
CTF competition performanceHuman expert parity on major CTFsBelow expert performanceCompetition results
AI cyberattack attributionConfirmed AI-autonomous attackSuspected but unconfirmedCISA reports
Defensive capability gapOffense significantly outpaces defenseMixed defensive improvementsSecurity research

Structural Risks

Economic Displacement Tracking

IndicatorThresholdCurrent AssessmentSource
Major company AI-driven layoffs>10,000 workers in single announcementSeveral thousand maximumBureau of Labor Statistics
Task automation feasibility>50% of cognitive tasks automatable≈20-30% current estimatesMcKinsey Global Institute
AI tool adoption rates>50% knowledge worker adoption≈20-40% current adoptionEnterprise surveys
Wage stagnation in AI-affected sectors>10% relative decline vs. economyEarly signals detectedEconomic data
Job creation offset failureInsufficient new jobs to replace displacedToo early to assess definitivelyLabor economists

Epistemic Erosion Indicators

IndicatorThresholdCurrent AssessmentSource
Institutional trust collapse<20% trust in major institutions≈30-35% current levelsGallup polling
Synthetic content volume>50% of new online content AI-generated≈10-20% estimatedContent analysis studies
"Liar's dividend" defensesMajor figure escapes accountability using AI doubtSeveral attempts documentedMedia analysis
Content authentication failure<50% accuracy in human detection of AI content≈60-70% current accuracyStanford HAI research
Polarization acceleration>20% increase from baseline metricsGradual increase observedPolitical science research

Tripwire Framework

Tripwires are specific, observable conditions that trigger predetermined responses. Unlike general warning signs, tripwires are precisely defined, binary (crossed/not crossed), and linked to specific actions.

Category A: Deployment Pause Tripwires

TripwireTrigger ConditionPredetermined ResponseAuthority
Deception DetectionAI caught systematically deceiving in >5% of adversarial testsPause deployment, intensive safety evaluationLab safety teams
Autonomous HarmAI causes >$1M damage through autonomous actionPause similar deployments, external investigationRegulatory authority
Capability Jump>2 standard deviation improvement between versionsExtended safety evaluation before releaseIndependent evaluators
Safety Training FailureDangerous behavior persists despite safety trainingPause until root cause identified and fixedLab leadership

Category B: Research Escalation Tripwires

TripwireTrigger ConditionPredetermined ResponseFunding
Situational Awareness>80% accuracy on self-prediction tests10x increase in SA research fundingGovernment/philanthropy
Interpretability GapSafety-relevant concepts become less interpretableDouble interpretability research investmentLab commitments
Alignment LagCapability improving >2x faster than alignmentMandatory alignment investment parityRegulatory requirement
Evaluation InadequacyCurrent tests miss critical capabilitiesEmergency evaluation development programMulti-lab consortium

Category C: Policy Intervention Tripwires

TripwireTrigger ConditionPredetermined ResponseImplementation
WMD Development AttemptConfirmed AI-enabled WMD developmentEmergency international response protocolsUN Security Council
Democratic InterferenceAI influence operation affects major electionMandatory disclosure and transparency requirementsNational governments
Economic CrisisAI-attributable unemployment >3% in major economyAutomatic economic transition policiesLegislative triggers
Epistemic CollapseTrust in information systems below functional thresholdEmergency authentication infrastructure deploymentMulti-stakeholder initiative

Monitoring Infrastructure Assessment

Current State Analysis

Monitoring SystemCoverageQualityFundingGaps
Capability benchmarks60%Variable$5-15M/yearStandardization, mandatory reporting
Behavioral evaluation25%Low$2-8M/yearIndependent access, adversarial testing
Incident tracking15%Poor<$1M/yearSystematic reporting, classification
Social impact monitoring35%Moderate$3-10M/yearReal-time data, attribution
International coordination10%Minimal$1-3M/yearInformation sharing, common standards

Required Infrastructure Investment

SystemAnnual CostTimelinePriorityExpected Impact
Capability Observatory$15-35M12-18 monthsCritical90% coverage of capability signals
Independent Behavioral Evaluation$30-70M18-36 monthsCritical70% coverage of behavioral risks
AI Incident Database$8-20M6-12 monthsHigh95% coverage of incident signals
Social Impact Tracker$10-25M12-24 monthsMedium60% coverage of social indicators
International Coordination$10-25M24-48 monthsHighCross-jurisdictional coverage

Total recommended annual investment: $80-200M (currently ≈$15-40M)

Implementation Roadmap

Phase 1: Foundation (Months 1-12)

  • Establish AI Incident Database with NIST AI Risk Management Framework integration
  • Create standardized evaluation protocols through METR and ARC
  • Implement basic capability tracking via MLPerf extensions
  • Develop tripwire governance frameworks with Partnership on AI

Phase 2: Expansion (Months 12-24)

  • Deploy independent behavioral evaluation infrastructure
  • Establish international information sharing protocols
  • Create social impact monitoring dashboards
  • Implement first-generation tripwire responses

Phase 3: Optimization (Months 24-36)

  • Refine detection thresholds based on empirical data
  • Expand monitoring to emerging risk categories
  • Develop adversarial-robust evaluation methods
  • Scale international coordination mechanisms

Key Uncertainties and Cruxes

Detection Feasibility

UncertaintyOptimistic CasePessimistic CaseCurrent Evidence
Advanced AI deception detectabilitySophisticated tests can identify strategic deceptionAI becomes undetectably deceptiveMixed results from Anthropic sleeper agent research
Capability generalization predictabilityBenchmark performance predicts real-world capabilitySignificant gap between benchmarks and deploymentGPT-4 evaluation gaps documented
Behavioral consistency across contextsLab evaluations predict deployment behaviorSignificant context-dependent variationLimited deployment monitoring data
International monitoring cooperationEffective information sharing achievedNational security concerns prevent cooperationMixed precedents from other domains

Response Effectiveness

The effectiveness of predetermined responses to warning signs remains highly uncertain, with limited empirical evidence about what interventions successfully mitigate emerging AI risks.

Response credibility: Pre-committed responses may not be honored when economic or competitive pressure intensifies. Historical precedents from climate change and financial regulation suggest that advance commitments often weaken at decision points.

Intervention effectiveness: Most proposed interventions (deployment pauses, additional safety research, policy responses) lack empirical validation for their ability to reduce AI risks. The field relies heavily on theoretical arguments about intervention effectiveness.

Coordination sustainability: Multi-stakeholder coordination for monitoring and response faces collective action problems that may intensify as economic stakes grow and geopolitical tensions increase.

Current State and Trajectory

Monitoring Infrastructure Development

Several initiatives are establishing components of the warning signs framework, but coverage remains fragmentary and uncoordinated.

Government initiatives: The UK AI Safety Institute and proposed US AI Safety Institute represent significant steps toward independent evaluation capacity. However, both organizations are resource-constrained and lack authority for mandatory reporting or response coordination.

Industry self-regulation: Anthropic's Responsible Scaling Policy and OpenAI's Preparedness Framework include elements of warning signs monitoring and tripwire responses. However, these commitments are voluntary, uncoordinated across companies, and lack external verification.

Academic research: Organizations like METR, ARC, and Apollo Research are developing evaluation methodologies, but their access to frontier models remains limited and funding is insufficient for comprehensive monitoring.

Five-Year Trajectory Projections

Based on current trends and announced initiatives, the warning signs monitoring landscape in 2029 will likely feature:

Capability2024 Status2029 ProjectionConfidence
Systematic capability trackingFragmentedModerate coverage via AI Safety InstitutesMedium
Independent behavioral evaluationMinimalLimited but growing capacityMedium
Incident reporting infrastructureAd hocBasic systematic trackingHigh
International coordinationNascentBilateral/multilateral frameworks emergingLow
Tripwire governanceConceptualSome implementation in major economiesLow

The most likely outcome is partial progress on monitoring infrastructure without commensurate development of governance systems for response. This creates the dangerous possibility of detecting warning signs without capacity for effective action.

Comparative Analysis

Historical Precedents

DomainWarning System QualityResponse EffectivenessLessons for AI
Financial crisis monitoringModerate: Some indicators trackedPoor: Known risks materializedNeed pre-committed response protocols
Pandemic surveillanceGood: WHO global monitoringVariable: COVID response fragmentedImportance of international coordination
Nuclear proliferationGood: IAEA monitoring regimeModerate: Some prevention successesValue of verification and consequences
Climate change trackingExcellent: Comprehensive measurementPoor: Insufficient policy responseDetection ≠ action without governance

The climate change analogy is particularly instructive: highly sophisticated monitoring systems have provided increasingly accurate warnings about risks, but institutional failures have prevented adequate response despite clear signals.

Other Risk Domains

AI warning signs monitoring can learn from more mature risk assessment frameworks:

  • Financial systemic risk: Federal Reserve stress testing provides model for mandatory capability evaluation
  • Cybersecurity threat detection: CISA information sharing demonstrates feasibility of coordinated monitoring
  • Public health surveillance: CDC disease monitoring shows real-time tracking at scale
  • Nuclear safety: Nuclear Regulatory Commission provides precedent for licensing with safety milestones

Expert Perspectives

Leading researchers emphasize different aspects of warning signs frameworks based on their risk models and expertise areas.

Dario Amodei (Anthropic CEO) has argued that "responsible scaling policies must define concrete capability thresholds that trigger safety requirements," emphasizing the need for predetermined responses rather than ad hoc decision-making. Anthropic's approach focuses on creating "if-then" commitments that remove discretion at evaluation points.

Dan Hendrycks (Center for AI Safety) advocates for "AI safety benchmarks that measure existential risk-relevant capabilities," arguing that current evaluation focused on helpfulness misses the most concerning capabilities. His work emphasizes the importance of red-teaming and adversarial evaluation.

Geoffrey Hinton has warned that "we may not get warning signs" for the most dangerous AI capabilities, expressing skepticism about detection-based approaches. This perspective emphasizes the importance of proactive measures rather than reactive monitoring.

Stuart Russell argues for "rigorous testing before deployment" with emphasis on worst-case scenario evaluation rather than average-case performance metrics, highlighting the difficulty of detecting rare but catastrophic behaviors.

Sources & Resources

Academic Research

SourceContributionAccess
Anthropic Constitutional AI ResearchBehavioral evaluation methodologiesOpen
Redwood Research InterpretabilityDeception detection techniquesOpen
CHAI Safety EvaluationCorrigibility testing frameworksAcademic
MIRI Agent FoundationsTheoretical warning sign analysisOpen

Policy and Governance

SourceContributionAccess
NIST AI Risk Management FrameworkGovernment monitoring standardsPublic
Partnership on AI Safety FrameworkIndustry coordination mechanismsPublic
EU AI Act ImplementationRegulatory monitoring requirementsPublic
UK AI Safety Institute EvaluationsIndependent evaluation approachesLimited public

Industry Frameworks

SourceContributionAccess
Anthropic Responsible Scaling PolicyTripwire implementation examplePublic
OpenAI Preparedness FrameworkRisk threshold methodologyPublic
DeepMind Frontier Safety FrameworkCapability evaluation approachPublic
MLPerf BenchmarkingStandardized capability measurementPublic

Monitoring Organizations

OrganizationFocusAssessment Access
METR (Model Evaluation & Threat Research)Behavioral evaluation, dangerous capabilitiesLimited
ARC (Alignment Research Center)Autonomous replication evaluationResearch partnerships
Apollo ResearchDeception and situational awarenessAcademic collaboration
Epoch AICompute and capability forecastingPublic research

International Coordination

InitiativeScopeStatus
AI Safety Summit ProcessInternational cooperation frameworksOngoing
Seoul Declaration on AI SafetyShared safety commitmentsSigned 2024
OECD AI Policy ObservatoryPolicy coordinationActive monitoring
UN AI Advisory BodyGlobal governance frameworkDevelopment phase

References

The Alignment Research Center (ARC) is a non-profit research organization focused on technical AI alignment and safety research. ARC works on understanding and addressing risks from advanced AI systems, including interpretability, evaluations, and identifying dangerous AI capabilities before deployment.

RAND Corporation is a nonprofit research organization providing objective analysis and policy recommendations across a wide range of topics including national security, technology, governance, and emerging risks. It produces influential studies on AI policy, cybersecurity, and global governance challenges. RAND's work is frequently cited by governments and policymakers worldwide.

★★★★☆

The CDC is the United States' primary federal public health agency, responsible for disease surveillance, outbreak response, health promotion, and biosafety standards. It serves as a key institutional model for monitoring, early-warning systems, and coordinated responses to emerging threats. Its frameworks for epidemiological tracking and biosecurity are often referenced in AI safety discussions around pandemic risk and biosecurity governance.

Partnership on AI (PAI) is a nonprofit coalition of AI researchers, civil society organizations, academics, and companies working to develop best practices, conduct research, and shape policy around responsible AI development. It brings together diverse stakeholders to address challenges including safety, fairness, transparency, and the societal impacts of AI systems. PAI serves as a coordination hub for cross-sector dialogue on AI governance.

★★★☆☆

Epoch AI is a research organization focused on investigating and forecasting trends in artificial intelligence, particularly around compute, training data, and algorithmic progress. They produce empirical analyses and datasets to inform understanding of AI development trajectories and support better decision-making in AI governance and safety.

★★★★☆
6McKinsey Global InstituteMcKinsey & Company

The McKinsey Global Institute (MGI) is the research arm of McKinsey & Company, producing reports on economic and business trends including AI's impact on productivity, labor markets, and global industries. MGI frequently publishes influential analyses on AI adoption, workforce transformation, and technology governance that inform corporate and policy discussions.

★★★☆☆
7EU AI Act – Official Resource Hubartificialintelligenceact.eu

The EU AI Act is the world's first comprehensive legal framework for artificial intelligence, establishing a risk-based classification system for AI applications. It imposes varying obligations on developers and deployers depending on the risk level of their AI systems, from minimal-risk to unacceptable-risk categories. The act sets precedents for global AI governance and compliance requirements.

DARPA is the U.S. Department of Defense's primary research agency focused on creating transformative technologies for national security. The homepage highlights current programs including autonomous systems (RACER mine-clearing), battlefield casualty care (Live Chain), and biosecurity challenges. DARPA funds high-risk, high-reward research across AI, autonomy, biotechnology, and other emerging domains relevant to AI safety and governance.

9AI Safety Summit 2023UK Government·Government

The official UK government page for the AI Safety Summit 2023, held November 1-2 at Bletchley Park, which convened governments, AI companies, civil society, and researchers to address frontier AI risks. Key outputs include the Bletchley Declaration—a multilateral agreement on AI safety—company safety policies, and a frontier AI capabilities and risks discussion paper. The summit marked a landmark moment in international AI governance coordination.

★★★★☆
10US AI Safety InstituteNIST·Government

NIST is the U.S. national metrology and standards institute, playing a central role in AI safety through the AI Risk Management Framework (AI RMF) and hosting the U.S. AI Safety Institute (AISI). It develops technical standards, evaluation frameworks, and guidance for trustworthy AI systems used by industry and government.

★★★★★

The UN Secretary-General's High-level Advisory Body on AI released 'Governing AI for Humanity' in September 2024, proposing a globally inclusive and distributed architecture for AI governance. The report includes seven recommendations to address gaps in current AI governance, calls for international cooperation on AI risks and opportunities, and is based on extensive global consultations involving over 2,000 participants across all regions.

★★★★☆

Apollo Research is an AI safety organization focused on evaluating frontier AI systems for dangerous capabilities, particularly 'scheming' behaviors where advanced AI covertly pursues misaligned objectives. They conduct LLM agent evaluations for strategic deception, evaluation awareness, and scheming, while also advising governments on AI governance frameworks.

★★★★☆

Anthropic introduces its Responsible Scaling Policy (RSP), a framework of technical and organizational protocols for managing catastrophic risks as AI systems become more capable. The policy defines AI Safety Levels (ASL-1 through ASL-5+), modeled after biosafety level standards, requiring increasingly strict safety, security, and operational measures tied to a model's potential for catastrophic risk. Current Claude models are classified ASL-2, with ASL-3 and beyond triggering stricter deployment and security requirements.

★★★★☆
14Redwood Research: AI Controlredwoodresearch.org

Redwood Research is a nonprofit AI safety organization that pioneered the 'AI control' research agenda, focusing on preventing intentional subversion by misaligned AI systems. Their key contributions include the ICML paper on AI Control protocols, the Alignment Faking demonstration (with Anthropic), and consulting work with governments and AI labs on misalignment risk mitigation.

METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.

★★★★☆

The Seoul Declaration on AI Safety is an international governmental agreement emerging from the AI Seoul Summit, building on the Bletchley Declaration to advance cooperative commitments on AI safety, governance, and risk management among participating nations. The page is currently inaccessible, but the declaration represents a significant multilateral policy milestone in global AI governance.

17MITRE ATT&CK Frameworkattack.mitre.org

MITRE ATT&CK is a globally accessible, open knowledge base cataloging adversary tactics and techniques based on real-world observations. It provides a structured matrix of attack behaviors across enterprise, mobile, and ICS environments, used by defenders, researchers, and policymakers to build threat models and improve cybersecurity defenses.

The NIST AI RMF is a voluntary, consensus-driven framework released in January 2023 to help organizations identify, assess, and manage risks associated with AI systems while promoting trustworthiness across design, development, deployment, and evaluation. It provides structured guidance organized around core functions and is accompanied by a Playbook, Roadmap, and a Generative AI Profile (2024) addressing risks specific to generative AI systems.

★★★★★
19Bureau of Labor StatisticsBureau of Labor Statistics·Government

The U.S. Bureau of Labor Statistics is the principal federal agency responsible for measuring labor market activity, working conditions, and price changes in the U.S. economy. It publishes key economic indicators including employment figures, unemployment rates, Consumer Price Index (CPI), and wage data. These statistics serve as primary reference data for tracking labor market disruptions that could be associated with AI-driven automation.

★★★★★

MLCommons is an industry-academia consortium of 125+ members focused on developing open, standardized benchmarks and measurement tools for AI performance, safety, and efficiency. It produces widely-used benchmarks like MLPerf and safety evaluation frameworks to enable accountable, responsible AI development across the industry.

21Nuclear Regulatory Commissionnrc.gov·Government

The U.S. Nuclear Regulatory Commission (NRC) is the federal agency responsible for regulating civilian nuclear power plants and nuclear materials to protect public health, safety, and the environment. The site provides access to regulatory documents, event reports, public meeting schedules, licensing information, and policy updates including the ADVANCE Act and AI initiatives. It serves as the authoritative source for U.S. nuclear safety governance and oversight.

22Nature interview 2024Nature (peer-reviewed)·Paper

This is the homepage of Nature, a leading multidisciplinary scientific journal, displaying current news and research articles. The visible content includes stories on AI's influence on human expression, China's AI ambitions, and AI-driven memory shortages in labs, alongside biology and neuroscience research. No specific AI safety paper or interview is identifiable from the content provided.

★★★★★
23AI Safety Institute - GOV.UKUK Government·Government

The UK AI Safety Institute (recently rebranded as the AI Security Institute) is a government body under the Department for Science, Innovation and Technology focused on minimizing risks from rapid and unexpected AI advances. It conducts and publishes safety research, international coordination reports, and policy guidance, while managing grants for systemic AI safety research.

★★★★☆

MIRI is a nonprofit research organization focused on ensuring that advanced AI systems are safe and beneficial. It conducts technical research on the mathematical foundations of AI alignment, aiming to solve core theoretical problems before transformative AI is developed. MIRI is one of the pioneering organizations in the AI safety field.

★★★☆☆
25Federal Select Agent Program (CDC/USDA)selectagents.gov·Government

The Federal Select Agent Program is a joint CDC-USDA regulatory initiative governing the possession, use, and transfer of dangerous biological agents and toxins that could threaten public, animal, or plant health. It provides biosecurity infrastructure including entity inspections, personnel security risk assessments, a national database, and compliance enforcement. The program serves as a key model for biosecurity governance relevant to emerging biotechnology risks.

CHAI is a UC Berkeley research center dedicated to reorienting AI development toward systems that are provably beneficial and aligned with human values. It conducts technical and conceptual research on problems including value alignment, corrigibility, and AI safety, and serves as a major hub for academic AI safety work.

CISA is the U.S. federal agency responsible for cybersecurity and critical infrastructure protection. It coordinates national efforts to defend against cyber threats, shares threat intelligence, and sets security standards for government and private sector systems. Relevant to AI safety through its work on securing AI-enabled infrastructure and emerging technology risks.

★★★★☆

Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its family of AI assistants, with a stated mission of responsible development and maintenance of advanced AI for long-term human benefit.

★★★★☆

Gryphon Scientific is a consulting and research firm specializing in biosecurity, public health preparedness, and risk analysis. They conduct technical assessments and policy studies related to biological threats, dual-use research, and biosafety. Their work informs government and institutional decision-making on biological risk management.

Stanford's Human-Centered Artificial Intelligence (HAI) institute explores the intersection of AI companions and mental health, examining benefits, risks, and governance considerations of AI-powered emotional support tools. The resource reflects HAI's broader mission of responsible AI development that centers human well-being.

★★★★☆

OpenAI's Preparedness Framework outlines a systematic approach to tracking, evaluating, and mitigating catastrophic risks from frontier AI models. It establishes risk categories (CBRN, cybersecurity, model autonomy, persuasion), defines severity levels from 'low' to 'critical', and sets safety thresholds that must be met before model deployment or further scaling. The framework also describes organizational accountability structures including a Safety Advisory Group and board-level oversight.

★★★★☆

DeepMind's Frontier Safety Framework (FSF) establishes a structured approach to identifying and mitigating catastrophic risks from highly capable AI models before and during deployment. It introduces 'Critical Capability Levels' (CCLs) as thresholds that trigger enhanced safety evaluations, and outlines mitigation measures to prevent severe harms such as bioweapons development or AI autonomously undermining human oversight. The framework represents a concrete institutional commitment to capability-gated safety protocols.

★★★★☆

The UN Security Council is the primary international body responsible for maintaining international peace and security, with authority to impose sanctions, authorize military action, and establish peacekeeping operations. It serves as a key governance forum for addressing global threats, including emerging technology risks. Its decisions are binding on all UN member states.

★★★★☆

Gallup is a global analytics and advisory firm known for its public opinion polling, workplace engagement research, and large-scale surveys on societal trends. It produces data on public attitudes toward emerging technologies, AI, and institutional trust that can serve as indicators for monitoring societal responses. Its polling methodologies are widely used as reference data in policy and governance contexts.

★★★★☆
35OpenAI: Model BehaviorOpenAI·Rakshith Purushothaman·2025·Paper

This is OpenAI's research overview page describing their work toward artificial general intelligence (AGI). The page outlines OpenAI's mission to ensure AGI benefits all of humanity and highlights their major research focus areas: the GPT series (versatile language models for text, images, and reasoning), the o series (advanced reasoning systems using chain-of-thought processes for complex STEM problems), visual models (CLIP, DALL-E, Sora for image and video generation), and audio models (speech recognition and music generation). The page serves as a hub linking to detailed research announcements and technical blogs across these domains.

★★★★☆

The OECD AI Policy Observatory is a comprehensive platform tracking AI policy developments, principles, and governance frameworks across member and partner countries. It provides tools, data, and analysis to help policymakers and stakeholders understand and shape responsible AI development. It is the home of the OECD AI Principles, adopted in 2019 as the first intergovernmental standard on AI.

★★★★☆
37Federal Reserve System - Official Websitefederalreserve.gov·Government

The Federal Reserve is the central bank of the United States, responsible for monetary policy, financial system stability, and banking regulation. It provides data, research, and policy communications relevant to macroeconomic conditions. As a key financial regulatory institution, it may serve as a reference for understanding economic infrastructure that AI systems could interact with or impact.

Anthropic's research page aggregates their work across AI alignment, mechanistic interpretability, and societal impact assessment, all oriented toward understanding and mitigating risks from increasingly capable AI systems. It serves as a central hub for their published findings and ongoing safety-focused investigations.

★★★★☆

Related Wiki Pages

Top Related Pages

Risks

Corrigibility FailureEpistemic Collapse

Approaches

Compute MonitoringCorporate AI Safety Responses

Analysis

AI Policy EffectivenessAI Safety Intervention Effectiveness MatrixCorrigibility Failure PathwaysPower-Seeking Emergence Conditions Model

Other

Geoffrey HintonDario AmodeiStuart RussellDan Hendrycks

Organizations

Redwood ResearchMETR