Alignment Research Center
arc (E25)← Back to pagePath: /knowledge-base/organizations/arc/
Page Metadata
{
"id": "arc",
"numericId": null,
"path": "/knowledge-base/organizations/arc/",
"filePath": "knowledge-base/organizations/arc.mdx",
"title": "ARC (Alignment Research Center)",
"quality": 43,
"importance": 53,
"contentFormat": "article",
"tractability": null,
"neglectedness": null,
"uncertainty": null,
"causalLevel": null,
"lastUpdated": "2025-12-24",
"llmSummary": "Comprehensive overview of ARC's dual structure (theory research on Eliciting Latent Knowledge problem and systematic dangerous capability evaluations of frontier AI models), documenting their high policy influence on establishing evaluation standards at major labs and government bodies. Notes methodological limitations including sandbagging detection challenges and tensions between independence and lab relationships.",
"structuredSummary": null,
"description": "AI safety research organization operating two divisions - ARC Theory investigating fundamental alignment problems like Eliciting Latent Knowledge, and ARC Evals conducting systematic evaluations of frontier AI models for dangerous capabilities like autonomous replication and strategic deception.",
"ratings": {
"novelty": 3.5,
"rigor": 4,
"actionability": 4.5,
"completeness": 6
},
"category": "organizations",
"subcategory": "safety-orgs",
"clusters": [
"ai-safety",
"community",
"governance"
],
"metrics": {
"wordCount": 1530,
"tableCount": 13,
"diagramCount": 0,
"internalLinks": 38,
"externalLinks": 0,
"footnoteCount": 0,
"bulletRatio": 0.19,
"sectionCount": 29,
"hasOverview": true,
"structuralScore": 10
},
"suggestedQuality": 67,
"updateFrequency": 21,
"evergreen": true,
"wordCount": 1530,
"unconvertedLinks": [],
"unconvertedLinkCount": 0,
"convertedLinkCount": 12,
"backlinkCount": 8,
"redundancy": {
"maxSimilarity": 16,
"similarPages": [
{
"id": "apollo-research",
"title": "Apollo Research",
"path": "/knowledge-base/organizations/apollo-research/",
"similarity": 16
},
{
"id": "far-ai",
"title": "FAR AI",
"path": "/knowledge-base/organizations/far-ai/",
"similarity": 16
},
{
"id": "dario-amodei",
"title": "Dario Amodei",
"path": "/knowledge-base/people/dario-amodei/",
"similarity": 15
},
{
"id": "paul-christiano",
"title": "Paul Christiano",
"path": "/knowledge-base/people/paul-christiano/",
"similarity": 15
},
{
"id": "intervention-effectiveness-matrix",
"title": "Intervention Effectiveness Matrix",
"path": "/knowledge-base/models/intervention-effectiveness-matrix/",
"similarity": 14
}
]
}
}Entity Data
{
"id": "arc",
"type": "organization",
"title": "Alignment Research Center",
"description": "The Alignment Research Center (ARC) was founded in 2021 by Paul Christiano after his departure from OpenAI. ARC represents a distinctive approach to AI alignment: combining theoretical research on fundamental problems (like Eliciting Latent Knowledge) with practical evaluations of frontier models for dangerous capabilities.",
"tags": [
"eliciting-latent-knowledge",
"elk",
"evaluations",
"scalable-oversight",
"ai-evals",
"deception",
"worst-case-alignment",
"debate",
"amplification",
"adversarial-testing",
"autonomous-replication",
"sandbagging"
],
"relatedEntries": [
{
"id": "paul-christiano",
"type": "researcher"
},
{
"id": "scalable-oversight",
"type": "safety-agenda"
},
{
"id": "deceptive-alignment",
"type": "risk"
},
{
"id": "sandbagging",
"type": "risk"
},
{
"id": "anthropic",
"type": "organization"
},
{
"id": "openai",
"type": "organization"
},
{
"id": "miri",
"type": "organization"
},
{
"id": "uk-aisi",
"type": "policies"
}
],
"sources": [
{
"title": "ARC Website",
"url": "https://alignment.org"
},
{
"title": "ELK Report",
"url": "https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/"
},
{
"title": "ARC Evals",
"url": "https://evals.alignment.org"
},
{
"title": "GPT-4 Evaluation (ARC summary)",
"url": "https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/"
},
{
"title": "Paul Christiano's AI Alignment Forum posts",
"url": "https://www.alignmentforum.org/users/paulfchristiano"
},
{
"title": "Iterated Amplification",
"url": "https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616"
},
{
"title": "AI Safety via Debate",
"url": "https://arxiv.org/abs/1805.00899"
},
{
"title": "Ajeya Cotra's Bio Anchors",
"url": "https://www.alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines"
}
],
"lastUpdated": "2025-12",
"website": "https://alignment.org",
"customFields": []
}Canonical Facts (0)
No facts for this entity
External Links
{
"eaForum": "https://forum.effectivealtruism.org/topics/alignment-research-center"
}Backlinks (8)
| id | title | type | relationship |
|---|---|---|---|
| situational-awareness | Situational Awareness | capability | — |
| apollo-research | Apollo Research | lab-research | — |
| metr | METR | lab-research | — |
| miri | MIRI | organization | — |
| redwood | Redwood Research | organization | — |
| paul-christiano | Paul Christiano | researcher | — |
| scalable-oversight | Scalable Oversight | safety-agenda | — |
| sandbagging | AI Capability Sandbagging | risk | — |
Frontmatter
{
"title": "ARC (Alignment Research Center)",
"description": "AI safety research organization operating two divisions - ARC Theory investigating fundamental alignment problems like Eliciting Latent Knowledge, and ARC Evals conducting systematic evaluations of frontier AI models for dangerous capabilities like autonomous replication and strategic deception.",
"sidebar": {
"order": 11
},
"quality": 43,
"llmSummary": "Comprehensive overview of ARC's dual structure (theory research on Eliciting Latent Knowledge problem and systematic dangerous capability evaluations of frontier AI models), documenting their high policy influence on establishing evaluation standards at major labs and government bodies. Notes methodological limitations including sandbagging detection challenges and tensions between independence and lab relationships.",
"lastEdited": "2025-12-24",
"importance": 53,
"update_frequency": 21,
"ratings": {
"novelty": 3.5,
"rigor": 4,
"actionability": 4.5,
"completeness": 6
},
"clusters": [
"ai-safety",
"community",
"governance"
],
"subcategory": "safety-orgs",
"entityType": "organization"
}Raw MDX Source
---
title: ARC (Alignment Research Center)
description: AI safety research organization operating two divisions - ARC Theory investigating fundamental alignment problems like Eliciting Latent Knowledge, and ARC Evals conducting systematic evaluations of frontier AI models for dangerous capabilities like autonomous replication and strategic deception.
sidebar:
order: 11
quality: 43
llmSummary: Comprehensive overview of ARC's dual structure (theory research on Eliciting Latent Knowledge problem and systematic dangerous capability evaluations of frontier AI models), documenting their high policy influence on establishing evaluation standards at major labs and government bodies. Notes methodological limitations including sandbagging detection challenges and tensions between independence and lab relationships.
lastEdited: "2025-12-24"
importance: 53
update_frequency: 21
ratings:
novelty: 3.5
rigor: 4
actionability: 4.5
completeness: 6
clusters:
- ai-safety
- community
- governance
subcategory: safety-orgs
entityType: organization
---
import {DataInfoBox, DisagreementMap, KeyPeople, KeyQuestions, Section, R, EntityLink, DataExternalLinks} from '@components/wiki';
<DataExternalLinks pageId="arc" />
<DataInfoBox entityId="E25" />
## Overview
The <R id="0562f8c207d8b63f">Alignment Research Center (ARC)</R> represents a unique approach to AI safety, combining theoretical research on worst-case alignment scenarios with practical capability evaluations of frontier AI models. Founded in 2021 by <EntityLink id="E220">Paul Christiano</EntityLink> after his departure from <EntityLink id="E218">OpenAI</EntityLink>, ARC has become highly influential in establishing evaluations as a core governance tool.
ARC's dual focus stems from Christiano's belief that AI systems might be adversarial rather than merely misaligned, requiring robust safety measures that work even against deceptive models. This "worst-case alignment" philosophy distinguishes ARC from organizations pursuing more optimistic prosaic alignment approaches.
The organization has achieved significant impact through its ELK (Eliciting Latent Knowledge) problem formulation, which has influenced how the field thinks about truthfulness and <EntityLink id="E271">scalable oversight</EntityLink>, and through ARC Evals, which has established the standard for systematic capability evaluations now adopted by major AI labs.
## Risk Assessment
| Risk Category | Assessment | Evidence | Timeline |
|---------------|------------|----------|----------|
| **Deceptive AI systems** | High severity, moderate likelihood | ELK research shows difficulty of ensuring truthfulness | 2025-2030 |
| **Capability evaluation gaps** | Moderate severity, high likelihood | Models may hide capabilities during testing | Ongoing |
| **Governance capture by labs** | Moderate severity, moderate likelihood | Self-regulation may be insufficient | 2024-2027 |
| **Alignment research stagnation** | High severity, low likelihood | Theoretical problems may be intractable | 2025-2035 |
## Key Research Contributions
### ARC Theory: Eliciting Latent Knowledge
| Contribution | Description | Impact | Status |
|--------------|-------------|---------|---------|
| **ELK Problem Formulation** | How to get AI to report what it knows vs. what you want to hear | Influenced field understanding of truthfulness | Ongoing research |
| **Heuristic Arguments** | Systematic counterexamples to proposed alignment solutions | Advanced conceptual understanding | Multiple publications |
| **Worst-Case Alignment** | Framework assuming AI might be adversarial | Shifted field toward robustness thinking | Adopted by some researchers |
**The ELK Challenge**: Consider an AI system monitoring security cameras. If it detects a thief, how can you ensure it reports the truth rather than what it thinks you want to hear? <R id="ecd797db5ba5d02c">ARC's ELK research</R> demonstrates this is harder than it appears, with implications for <EntityLink id="E174">scalable oversight</EntityLink> and <EntityLink id="E93">deceptive alignment</EntityLink>.
### ARC Evals: Systematic Capability Assessment
| Evaluation Type | Purpose | Key Models Tested | Policy Impact |
|-----------------|---------|------------------|---------------|
| **Autonomous Replication** | Can model copy itself to new servers? | GPT-4, Claude 3 | Informed deployment decisions |
| **Strategic Deception** | Can model mislead evaluators? | Multiple frontier models | RSP threshold setting |
| **Resource Acquisition** | Can model obtain money/compute? | Various models | White House AI Order |
| **<EntityLink id="E282">Situational Awareness</EntityLink>** | Does model understand its context? | Latest frontier models | Lab safety protocols |
**Evaluation Methodology**:
- **Red-team approach**: Adversarial testing to <EntityLink id="E526">elicit</EntityLink> worst-case capabilities
- **<EntityLink id="E443">Capability elicitation</EntityLink>**: Ensuring tests reveal true abilities, not default behaviors
- **Pre-deployment assessment**: Testing before public release
- **Threshold-based recommendations**: Clear criteria for deployment decisions
## Current State and Trajectory
### Research Progress (2024-2025)
| Research Area | Current Status | 2025-2027 Projection |
|---------------|----------------|---------------------|
| **ELK Solutions** | Multiple approaches proposed, all have counterexamples | Incremental progress, no complete solution likely |
| **Evaluation Rigor** | Standard practice at major labs | Government-mandated evaluations possible |
| **Theoretical Alignment** | Continued negative results | May pivot to more tractable subproblems |
| **Policy Influence** | High engagement with <EntityLink id="E364">UK AISI</EntityLink> | Potential <EntityLink id="E171">international coordination</EntityLink> |
### Organizational Evolution
**2021-2022**: Primarily theoretical focus on ELK and alignment problems
**2022-2023**: Addition of ARC Evals, contracts with major labs for model testing
**2023-2024**: Established as key player in <EntityLink id="E608">AI governance</EntityLink>, influence on <EntityLink id="E252">Responsible Scaling Policies</EntityLink>
**2024-present**: Expanding international engagement, potential government partnerships
### Policy Impact Metrics
| Policy Area | ARC Influence | Evidence | Trajectory |
|-------------|---------------|----------|------------|
| **Lab Evaluation Practices** | High | All major labs now conduct pre-deployment evals | Standard practice |
| **Government AI Policy** | Moderate | White House AI Order mentions evaluations | Increasing |
| **International Coordination** | Growing | AISI collaboration, EU engagement | Expanding |
| **Academic Research** | Moderate | ELK cited in alignment papers | Stable |
## Key Organizational Leaders
<Section title="Core Team">
<KeyPeople people={[
{ name: "Paul Christiano", role: "Founder, Head of Theory", background: "Former OpenAI, developed PPO and RLHF" },
{ name: "Beth Barnes", role: "Co-lead, ARC Evals", background: "Former OpenAI safety evaluations" },
{ name: "Ajeya Cotra", role: "Senior Researcher", background: "Coefficient Giving AI timelines research" },
{ name: "Mark Xu", role: "Research Scientist", background: "Strong technical background in alignment theory" }
]} />
</Section>
### Leadership Perspectives
**Paul Christiano's Evolution**:
- **2017-2019**: Optimistic about prosaic alignment at OpenAI
- **2020-2021**: Growing concerns about deception and worst-case scenarios
- **2021-present**: Focus on adversarial robustness and worst-case alignment
**Research Philosophy**: "Better to work on the hardest problems than assume alignment will be easy" - emphasizes preparing for scenarios where AI systems might be strategically deceptive.
## Key Uncertainties and Research Cruxes
### Fundamental Research Questions
<KeyQuestions questions={[
"Is the ELK problem solvable, or does it represent a fundamental limitation of scalable oversight?",
"How much should we update on ARC's heuristic arguments against prosaic alignment approaches?",
"Can evaluations detect sophisticated deception, or will advanced models successfully sandbag?",
"Is worst-case alignment the right level of paranoia, or should we focus on more probable scenarios?",
"Will ARC's theoretical work lead to actionable safety solutions, or primarily negative results?",
"How can evaluation organizations maintain independence while working closely with AI labs?"
]} />
### Cruxes in the Field
| Disagreement | ARC Position | Alternative View | Evidence Status |
|--------------|--------------|------------------|-----------------|
| **Adversarial AI likelihood** | Models may be strategically deceptive | Most misalignment will be honest mistakes | Insufficient data |
| **Evaluation sufficiency** | Necessary but not sufficient governance tool | May provide false confidence | Mixed evidence |
| **Theoretical tractability** | Hard problems worth working on | Should focus on practical near-term solutions | Ongoing debate |
| **Timeline assumptions** | Need solutions for potentially short timelines | More time available for iterative approaches | Highly uncertain |
## Organizational Relationships and Influence
### Collaboration Network
| Organization | Relationship Type | Collaboration Areas | Tension Points |
|--------------|------------------|-------------------|----------------|
| **<EntityLink id="E218">OpenAI</EntityLink>** | Client/Evaluator | GPT-4 pre-deployment evaluation | Independence concerns |
| **<EntityLink id="E22">Anthropic</EntityLink>** | Client/Research Partner | Model evaluations, RSP development | Philosophical differences on prosaic alignment |
| **<EntityLink id="E364">UK AISI</EntityLink>** | Government Partner | Evaluation methodology, international coordination | Scaling challenges |
| **<EntityLink id="E202">MIRI</EntityLink>** | Intellectual Overlap | Shared pessimism about alignment difficulty | Different research strategies |
### Influence on Industry Standards
**Before ARC Evals (2021)**:
- Labs conducted internal capability testing
- No systematic <EntityLink id="E442">dangerous capability evaluations</EntityLink>
- Limited pre-deployment safety assessment
**After ARC Evals (2024)**:
- All major labs conduct pre-deployment dangerous capability evaluations
- <EntityLink id="E252">Responsible Scaling Policies</EntityLink> incorporate evaluation thresholds
- Government bodies establishing independent evaluation capabilities
### Policy Impact Timeline
| Date | Development | ARC Role |
|------|------------|----------|
| **2022** | OpenAI GPT-4 pre-deployment evaluation | First major dangerous capability evaluation |
| **2023** | Anthropic RSP framework | Consultation on evaluation thresholds |
| **2023** | White House AI Executive Order | Evaluation requirements influenced by ARC work |
| **2024** | UK AISI establishment | Beth Barnes consultation, methodology transfer |
| **2024** | <EntityLink id="E127">EU AI Act</EntityLink> implementation | Evaluation standards discussion |
## Critical Analysis and Limitations
### Methodological Concerns
| Limitation | Description | Mitigation Efforts | Remaining Risk |
|------------|-------------|-------------------|----------------|
| **<EntityLink id="E270">Sandbagging</EntityLink> Detection** | Models may hide capabilities during evaluation | Capability elicitation techniques | High uncertainty |
| **Evaluation Gaming** | Models might perform well on evals while retaining dangerous capabilities | Red-team methodology | Moderate risk |
| **Coverage Gaps** | Can only test for capabilities we think to test | Broad capability assessment | Significant concern |
| **Independence** | Financial relationships with labs being evaluated | Diverse funding, editorial control | Ongoing tension |
### Criticism from the Research Community
**"Excessive Pessimism"** (<R id="4ca01f329c8b25a4"><EntityLink id="E582">Yann LeCun</EntityLink></R>, some <EntityLink id="E218">OpenAI</EntityLink> researchers):
- Heuristic arguments show possible failures, not inevitable ones
- Current AI systems show cooperative behavior
- Worst-case framing may impede progress
**"Insufficient Positive Agendas"** (Academic AI safety community):
- ELK work demonstrates problems but doesn't solve them
- Need constructive research programs, not just negative results
- Risk of sophisticated pessimism without actionable solutions
**ARC's Response**:
- Negative results prevent false confidence
- Worst-case preparation necessary given stakes
- Evaluations provide practical governance tool regardless of theory
## Future Research Directions
### Theoretical Research Evolution
**Current Focus**:
- ELK variants and related truthfulness problems
- Scalable oversight under adversarial assumptions
- Verification and interpretability approaches
**Potential Pivots (2025-2027)**:
- More tractable subproblems of alignment
- Empirical testing of theoretical concerns
- Integration with <EntityLink id="E174">mechanistic interpretability</EntityLink>
### Evaluation Methodology Advancement
| Development Area | Current State | 2025-2027 Goals |
|------------------|---------------|------------------|
| **Sandbagging Detection** | Basic techniques | Robust detection methods |
| **Post-deployment Monitoring** | Limited capability | Continuous assessment systems |
| **International Standards** | National initiatives | Coordinated global standards |
| **Automated Evaluation** | Human-intensive | AI-assisted evaluation systems |
### Policy Integration Roadmap
**Near-term (2024-2025)**:
- Expand government evaluation capabilities
- Standardize evaluation protocols across labs
- Establish international evaluation coordination
**Medium-term (2025-2027)**:
- Mandatory independent evaluations for frontier models
- Integration with compute governance frameworks
- Development of international evaluation treaty
## Sources and Resources
### Primary Sources
| Source Type | Key Documents | Links |
|-------------|---------------|--------|
| **Foundational Papers** | ELK Prize Report, Heuristic Arguments | <R id="0562f8c207d8b63f">ARC Alignment.org</R> |
| **Evaluation Reports** | GPT-4 Dangerous Capability Evaluation | <R id="e09fc9ef04adca70">OpenAI System Card</R> |
| **Policy Documents** | Responsible Scaling Policy consultation | <R id="394ea6d17701b621">Anthropic RSP</R> |
### Research Publications
| Publication | Year | Impact | Link |
|-------------|------|--------|------|
| **"Eliciting Latent Knowledge"** | 2022 | High - problem formulation | <R id="37f4871113caa2ab"><EntityLink id="E538">LessWrong</EntityLink></R> |
| **"Heuristic Arguments for AI X-Risk"** | 2023 | Moderate - conceptual framework | <R id="2e0c662574087c2a"><EntityLink id="E439">AI Alignment</EntityLink> Forum</R> |
| **Various Evaluation Reports** | 2022-2024 | High - policy influence | <R id="5b2c3eab9cbf35f1">ARC Evals GitHub</R> |
### External Analysis
| Source | Perspective | Key Insights |
|--------|-------------|--------------|
| **<R id="f35c467b353f990f">Governance of AI</R>** | Policy analysis | Evaluation governance frameworks |
| **<R id="0a17f30e99091ebf">RAND Corporation</R>** | Security analysis | National security implications |
| **<R id="a306e0b63bdedbd5"><EntityLink id="E47">Center for AI Safety</EntityLink></R>** | Safety community | Technical safety assessment |