Center for AI Safety (CAIS)
Center for AI Safety
CAIS is a nonprofit research organization founded by Dan Hendrycks that has distributed compute grants to researchers, published technical AI safety papers including the representation engineering and MACHIAVELLI benchmark papers, and organized the May 2023 Statement on AI Risk signed by over 350 AI researchers and industry leaders. The organization focuses on technical safety research, field-building, and policy communication.
Overview
The Center for AI Safety (CAIS)↗🔗 web★★★★☆Center for AI SafetyCenter for AI Safety (CAIS) – HomepageCAIS is one of the leading AI safety research organizations; this homepage provides an entry point to their research, public statements, and field-building initiatives relevant to anyone working in or entering AI safety.The Center for AI Safety (CAIS) is a research organization focused on mitigating catastrophic and existential risks from advanced AI systems. It conducts technical research, pub...ai-safetyexistential-riskalignmentfield-building+4Source ↗ is a nonprofit research organization that works to reduce societal-scale risks from artificial intelligence through technical research, field-building initiatives, and public communication. Founded by Dan Hendrycks, CAIS received substantial public attention in May 2023 when it organized a one-sentence statement on AI extinction risk that attracted signatures from over 350 AI researchers and industry figures, including several Turing Award recipients and heads of major AI laboratories.
CAIS operates across three areas: technical research on AI alignment and robustness, grant and fellowship programs intended to grow the AI safety research community, and communication efforts aimed at policymakers and the public. Its technical output includes work on Representation Engineering and the MACHIAVELLI benchmark for evaluating goal-directed behavior in AI systems. The organization has received substantial funding from EA-aligned sources including Coefficient Giving (formerly Open Philanthropy), a funding relationship that is relevant context for assessing its research priorities and institutional positioning.
CAIS occupies a distinct niche in the AI safety ecosystem: unlike academic centers such as CHAI or research-focused organizations like MIRI, it combines original technical research with explicit field-building and public communication goals. Critics have questioned whether its emphasis on long-run extinction risk is appropriately calibrated relative to near-term AI harms, and whether EA-concentrated funding in this space creates ideological homogeneity in safety research priorities. These debates are discussed in the Critiques and Limitations section below.
Organizational Background
CAIS was established as a nonprofit research organization with the goal of filling a perceived gap between technical AI safety research and broader scientific and public awareness of AI risks. Dan Hendrycks, who completed his PhD at UC Berkeley, co-founded CAIS with Oliver Zhang to provide infrastructure — compute grants, fellowships, educational resources, and policy engagement — that individual academic researchers lacked access to.
The organization's theory of change rests on several linked assumptions: that AI systems pose meaningful risks of societal-scale harm, including possible catastrophic outcomes; that the current period is important for establishing safety-relevant research norms and technical methods; and that field-building activities (funding researchers, running educational programs, facilitating policy engagement) will increase the probability of good outcomes by growing and coordinating the safety research community. Whether these assumptions are well-founded is contested, and the organization's critics have argued that the extinction-risk framing in particular overstates speculative long-run risks relative to observable near-term harms.
CAIS is legally structured as a nonprofit (EIN: 88-1751310). Its primary disclosed funders include Coefficient Giving and the Survival and Flourishing Fund. Per IRS Form 990 filings available on ProPublica, CAIS reported total revenue of $6.7M (2022), $16.1M (2023), and $10.2M (2024), for cumulative funding of approximately $33M since founding.
Funding
CAIS's primary disclosed funders have included Coefficient Giving (formerly Open Philanthropy), a philanthropic organization closely associated with the effective altruism movement. This funding relationship is material context for interpreting the organization's research agenda: Coefficient Giving has historically prioritized long-run catastrophic and extinction-level AI risk over near-term AI harms, and CAIS's framing broadly reflects this prioritization.
Per IRS Form 990 filings (ProPublica), CAIS reported total revenue of $6.7M (2022), $16.1M (2023), and $10.2M (2024). Major known grant sources include Open Philanthropy (≈$10.6M across 2022-2023 general support grants), SFF (≈$3.8M in 2024-2025), Good Ventures Foundation ($1.9M in 2024), and Founders Pledge ($0.9M in 2024). Total expenses were $7.2M in 2024, with total assets of $12.6M.
The concentration of AI safety funding through EA-aligned funders including Coefficient Giving (formerly Open Philanthropy) has been noted by critics as a potential source of ideological constraint on safety research priorities — organizations dependent on this funding may face implicit pressure to prioritize framings and research directions consistent with funder worldviews. CAIS has not publicly addressed this critique directly.
Key Research Areas
Technical Safety Research
| Research Domain | Key Contributions | Notes |
|---|---|---|
| Representation Engineering | Methods for reading and steering model internal representations | Published 2023↗🔗 web★★★★☆Google ScholarGoogle Scholar: Stuart RussellThis is the academic publication index for Stuart Russell, a seminal figure in AI safety; useful for tracking his research output and citation impact across alignment and beneficial AI topics.Google Scholar profile for Stuart Russell, professor at UC Berkeley and one of the most influential figures in AI safety research. Russell is co-author of the leading AI textboo...ai-safetyalignmentexistential-risktechnical-safety+1Source ↗; independent replication and scalability to frontier models remains an open research question |
| Safety Benchmarks | MACHIAVELLI benchmark for evaluating goal-directed and deceptive behavior | Cited in subsequent research; the extent to which it has been formally integrated into evaluation pipelines at Anthropic or OpenAI is not publicly documented |
| Adversarial Robustness | Evaluation protocols and defense mechanisms | Part of the broader Adversarial Robustness research agenda |
| Alignment Foundations | Conceptual frameworks and problem taxonomies for AI safety | Including the "Unsolved Problems in ML Safety" paper (2022) |
Major Publications & Tools
- Representation Engineering: A Top-Down Approach to AI Transparency↗📄 paper★★★☆☆arXivRepresentation Engineering: A Top-Down Approach to AI TransparencyThis paper introduces representation engineering, a method for enhancing AI transparency by analyzing and manipulating population-level representations in deep neural networks, directly addressing the interpretability and control challenges central to AI safety.Andy Zou, Long Phan, Sarah Chen et al. (2023)831 citationsThis paper introduces representation engineering (RepE), a top-down approach to AI transparency that analyzes population-level representations in deep neural networks rather tha...interpretabilitysafetyllmai-safety+1Source ↗ (2023) — Methods for understanding and influencing AI decision-making by working with internal representations rather than input-output behavior alone
- Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior↗📄 paper★★★☆☆arXivMACHIAVELLI datasetMACHIAVELLI is a benchmark dataset of 134 Choose-Your-Own-Adventure games designed to measure Machiavellian behaviors (power-seeking, deception) in AI agents and language models, addressing concerns about unaligned incentives in reward maximization and next-token prediction.Alexander Pan, Jun Shern Chan, Andy Zou et al. (2023)192 citationsMACHIAVELLI is a benchmark dataset of 134 Choose-Your-Own-Adventure games with over 500,000 scenarios designed to evaluate whether AI agents naturally learn Machiavellian behavi...capabilitiessafetydeceptionevaluation+1Source ↗ (2023) — Introduces the MACHIAVELLI benchmark for evaluating whether AI agents pursue goals through unethical means in text-based game environments
- Unsolved Problems in ML Safety↗📄 paper★★★☆☆arXivUnsolved Problems in ML SafetyFoundational paper providing a comprehensive roadmap of unsolved technical problems in ML safety, addressing emerging challenges from large-scale models and establishing research priorities for the field.Dan Hendrycks, Nicholas Carlini, John Schulman et al. (2021)This paper presents a comprehensive roadmap for ML safety research, identifying four critical problem areas that the field must address as machine learning systems grow larger a...alignmentcapabilitiessafetyai-safety+1Source ↗ (2022) — A taxonomy of open technical challenges in machine learning safety, intended partly as a research agenda for the field
- Measuring Mathematical Problem Solving With the MATH Dataset↗📄 paper★★★☆☆arXiv[2103.03874] Measuring Mathematical Problem Solving With the MATH DatasetMATH is a widely-used benchmark in AI capabilities research; results here established an early baseline showing scaling limits, later revisited as models like GPT-4 and specialized reasoning models achieved substantially higher scores.Dan Hendrycks, Collin Burns, Saurav Kadavath et al. (2021)4,613 citationsThis paper introduces MATH, a benchmark of 12,500 competition mathematics problems with step-by-step solutions, revealing that large Transformer models achieve surprisingly low ...capabilitiesevaluationllmcompute+2Source ↗ (2021) — A benchmark for evaluating AI mathematical reasoning, authored by Dan Hendrycks and collaborators during his PhD at UC Berkeley; this paper predates CAIS's founding and is a product of Hendrycks's academic research rather than an organizational output of CAIS
Citation counts for these papers (figures such as "200+", "50+", "30+") previously appeared on this page without sourced methodology. Readers seeking current citation data should consult Google Scholar or Semantic Scholar directly.
Field-Building Programs
CAIS runs several programs intended to grow the population of researchers working on AI safety. The term "field-building" refers to activities designed to increase the size, diversity, and coordination of a research community — in this case, researchers focused on technical and governance aspects of AI safety.
Grant Programs
| Program | Reported Scale | Description | Timeline |
|---|---|---|---|
| Compute Grants | $2M+ distributed; number of recipients reported variously as 100+ and 200+ in different CAIS materials — figure unverified | Provides compute resources to researchers working on safety-relevant projects | 2022–present |
| ML Safety Scholars | 63 graduates in the Summer 2022 cohort | Structured program for early-career researchers entering the AI safety field | 2021–present (pre-dates CAIS's 2022 founding; originated as an independent initiative) |
| Research Fellowships | Amount not publicly disclosed | Fellowships placing researchers at academic and research institutions | 2022–present |
| AI Safety Camp | Participant count not publicly disclosed | Collaborative program supporting international research teams | 2020–present (pre-dates CAIS's 2022 founding; originated as an independent initiative) |
Note: Quantitative figures in this table are drawn from CAIS's own communications and have not been independently verified. The ML Safety Scholars program was introduced in 2021 as an initiative led by Dan Hendrycks and collaborators during his time at UC Berkeley, and was later absorbed into CAIS's organizational umbrella.
Institutional Partnerships
- Academic Collaborations: CAIS’s compute cluster supports researchers from UC Berkeley, Stanford, University of Cambridge, ETH Zurich, and other institutions. Collaborative research has included work with Carnegie Mellon University on adversarial attacks on large language models.
- Industry Engagement: Research interactions with Anthropic and Google DeepMind have been reported in CAIS communications, though specific partnership details are not publicly documented.
- Policy Connections: CAIS’s Action Fund engages in AI policy advocacy, including sponsoring California SB 1047. Specific briefings with individual legislative bodies are not independently documented.
Statement on AI Risk (2023)
In May 2023, CAIS published and circulated the Statement on AI Risk↗🔗 web★★★★☆Center for AI SafetyStatement on AI Risk - Center for AI SafetyThis landmark 2023 open letter is frequently cited as a turning point in mainstream acknowledgment of existential AI risk, bringing together signatories from across the AI industry and policy world under a single succinct statement.A concise open letter coordinated by the Center for AI Safety stating that mitigating extinction-level risk from AI should be a global priority alongside pandemics and nuclear w...existential-riskai-safetygovernancepolicy+3Source ↗, a single sentence co-signed by over 350 AI researchers and industry figures:
"Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."
The statement was covered widely in major news outlets and was cited in subsequent policy discussions, including in the context of UK and US government AI strategies. The official signatory list is available at safe.ai; the figure of 350+ is drawn from that list, though the precise count at any given time may vary as signatories are added.
Signatory Groups
| Category | Notable Signatories | Description |
|---|---|---|
| Turing Award Recipients | Geoffrey Hinton, Yoshua Bengio, Stuart Russell | Recipients of computing's highest recognition who signed the statement |
| Industry Executives | Sam Altman (OpenAI), Dario Amodei (Anthropic), Demis Hassabis (DeepMind) | CEOs of major AI laboratories |
| Policy and Governance Researchers | Helen Toner, Allan Dafoe, Gillian Hadfield | Researchers working on AI governance and policy |
| ML/AI Researchers | 300+ researchers across academia and industry | Researchers who signed as individuals, not representing institutional positions |
The statement's reception was not uniformly positive within the AI research community. A number of prominent ML researchers declined to sign or publicly criticized the statement's framing. Critics raised several concerns: that the one-sentence format was too vague to convey meaningful technical content; that equating AI risk with nuclear war risk was unsupported by available evidence; that the extinction framing could distract attention and resources from observable near-term harms from AI systems (such as bias, surveillance, and labor displacement); and that the statement's signatories were not uniformly working on extinction-risk problems, making it a weak signal of scientific consensus. Timnit Gebru criticized it for elevating speculative extinction risks while being promoted by "the same people who have poured billions of dollars into these companies." Human Rights Watch argued that scientists should focus on the known risks of AI instead of speculative future dangers. Emile Torres and Gebru argued the statement may be motivated by TESCREAL ideologies.
Proponents argued that the statement served a legitimate coordination function: making it socially acceptable for researchers to discuss catastrophic risk publicly, signaling to policymakers that risk concerns were not fringe views, and creating a reference point for subsequent regulatory discussions. Whether the statement's net effect on AI policy and research prioritization was positive is a matter of ongoing debate.
The statement's impact on specific policy documents — including mentions in UK AI Safety Institute and US AI Safety Institute contexts — has been cited by CAIS, though the causal relationship between the statement and any particular policy outcome is difficult to establish.
Critiques and Limitations
Criticism of Extinction-Risk Framing
The most substantive criticism of CAIS's work concerns its central framing of AI extinction risk as a near-term policy priority. Critics from several directions have argued:
- Near-term displacement effect: Emphasizing speculative long-run extinction risk may draw funding, talent, and policy attention away from near-term AI harms — discrimination in algorithmic decision-making, AI-enabled surveillance, labor market disruption, and misinformation — that are currently affecting people. Researchers associated with the AI ethics and fairness communities, including Timnit Gebru and the DAIR Institute, have made this argument most consistently.
- Epistemic status of extinction claims: The probability of AI-caused human extinction within policy-relevant timeframes is highly uncertain, and critics have argued that treating it as a "global priority alongside pandemics and nuclear war" involves large unjustified inferential steps. Some ML researchers have noted that the mechanisms by which current or near-term AI systems could pose extinction-level risks are not specified with sufficient precision to evaluate.
- Ideological concentration: CAIS's alignment with EA-associated funders and the broader longtermist intellectual tradition has led critics to argue that its research agenda reflects a particular philosophical worldview rather than a neutral assessment of AI risk. This critique is not unique to CAIS — it applies to several EA-funded AI safety organizations — but it is relevant to assessing how to interpret CAIS's outputs.
Limitations of Specific Research
- Representation Engineering scalability: The representation engineering paper introduced methods that work on models of a given scale; whether these methods generalize to frontier-scale models is an open question. A survey of representation engineering identifies challenges including performance degradation at scale, computational overhead, and reliability concerns regarding whether the correlations identified are causal.
- Benchmark validity: A general concern in AI safety evaluation is whether constructed benchmarks (including MACHIAVELLI) capture risks that manifest in real deployment contexts. The MACHIAVELLI benchmark uses text-based game environments, and the extent to which performance on these environments predicts behavior in consequential real-world settings is not established.
- Field-building outcome measurement: CAIS reports counts of researchers supported and grant dollars distributed, but does not publicly report outcome data for its programs — for example, where ML Safety Scholars alumni work subsequently, what research they produce, or whether compute grant recipients remain in safety research. Without outcome data, the field-building impact claims are difficult to evaluate independently.
Critiques of the 2023 Statement
Beyond the framing critiques noted above, several researchers argued that the statement's format — a single declarative sentence without methodology, evidence, or mechanism — made it unsuitable as a scientific communication and more akin to a public advocacy document. Others noted that some signatories are not primarily working on extinction-risk problems, which complicated interpretation of the statement as a signal of expert consensus on the technical merits of the extinction-risk hypothesis. See the Wikipedia article on the Statement on AI Risk for a summary of these critiques and specific critics.
Current Trajectory & Timeline
Research Roadmap
The following research priorities were described by CAIS as goals for 2024–2026. Actual outcomes against these goals have not been independently verified and are not currently documented on this page.
| Priority Area | Stated Goals | Status |
|---|---|---|
| Representation Engineering | Scale methods to frontier models; pursue industry adoption for safety evaluation | Outcome unverified |
| Evaluation Frameworks | Develop comprehensive benchmark suite; establish standard evaluation protocols | Outcome unverified |
| Alignment Methods | Proof-of-concept demonstrations; practical implementation work | Outcome unverified |
| Policy Research | Technical governance recommendations; regulatory framework development | Outcome unverified |
A previously cited projection of "2x expansion by 2025" appeared in earlier versions of this page without a cited source. Whether this projection materialized has not been verified.
Organizational Scale
- Staff: CAIS is organized into four functional teams (Research, Cloud and DevOps, Projects, Operations); total headcount is not publicly disclosed
- Affiliates: The compute cluster supports 150+ active researchers across approximately 20 research labs
- Budget: $10.2M revenue and $7.2M expenses in 2024 per IRS Form 990
Key Uncertainties & Research Cruxes
Technical Challenges
These represent genuine open questions in CAIS's research agenda, not settled conclusions:
- Representation Engineering Scalability: Whether methods developed on mid-scale models transfer reliably to frontier-scale systems remains unclear. The gap between controlled research settings and deployment conditions is a known limitation.
- Benchmark Validity: Whether evaluations like MACHIAVELLI capture risks that manifest in real deployment — rather than behavior specific to text-game environments — is unresolved. This is a field-wide challenge, not unique to CAIS.
- Alignment Verification: There is no established consensus on how to verify that an AI system is successfully aligned with intended goals rather than passing evaluations through surface-level pattern matching.
Strategic Questions
- Research vs. Policy Balance: CAIS allocates resources across technical research, field-building, and policy communication. The optimal allocation is not obvious, and different observers weight these activities differently based on their models of how AI safety progress happens.
- Open vs. Closed Research: Publishing safety research openly makes it available to the broader community but may also inform adversarial actors. CAIS has not publicly articulated a detailed position on this tradeoff.
- Timeline Assumptions: Appropriate research priorities depend substantially on assumptions about AGI timelines and the nature of AI risk. Researchers with shorter timelines and those focused on long-run speculative risk reach different conclusions about what work is most valuable now.
- Near-term vs. Long-term Risk Balance: Whether resources spent on extinction-risk scenarios are appropriately calibrated relative to near-term AI harms is a live debate both within and outside the AI safety community, and CAIS's position at the long-run end of this spectrum is contested.
Leadership & Key Personnel
Key People
Note: Staff roles and affiliations reflect information available at time of last edit and may not reflect current positions. Andy Zou holds joint affiliation with CMU and CAIS; his primary institutional role should be verified against current sources.
Positioning Within the AI Safety Ecosystem
CAIS occupies a specific position within the broader AI safety research landscape that distinguishes it from peer organizations:
- vs. MIRI: MIRI focuses almost exclusively on foundational theoretical alignment research and does not run field-building or public communication programs. CAIS's research is more empirical and its scope is broader institutionally.
- vs. CHAI: CHAI (Center for Human-Compatible AI, UC Berkeley) is an academic center with a narrower research agenda centered on value alignment. CAIS has a more explicit field-building and policy communication mandate.
- vs. Redwood Research: Redwood focuses on specific empirical safety problems with a small team; CAIS has a larger scope including grant programs and public communication.
- vs. METR and ARC Evaluations: These organizations focus specifically on model evaluations and dangerous capability assessments. CAIS's evaluation work (MACHIAVELLI) is one component of a broader agenda.
- vs. GovAI: GovAI focuses on AI governance and policy research. CAIS does policy communication but its primary identity is as a technical research organization.
The common thread across CAIS-adjacent organizations is EA-aligned funding, primarily from Coefficient Giving, which has led to criticisms that the AI safety field as constituted reflects the priorities of a relatively narrow philanthropic and ideological community rather than a broad scientific consensus.
Sources & Resources
Official Resources
| Type | Resource | Description |
|---|---|---|
| Website | safe.ai↗🔗 web★★★★☆Center for AI SafetyCenter for AI Safety (CAIS) – HomepageCAIS is one of the leading AI safety research organizations; this homepage provides an entry point to their research, public statements, and field-building initiatives relevant to anyone working in or entering AI safety.The Center for AI Safety (CAIS) is a research organization focused on mitigating catastrophic and existential risks from advanced AI systems. It conducts technical research, pub...ai-safetyexistential-riskalignmentfield-building+4Source ↗ | Main organization hub |
| Research | CAIS Publications↗🔗 web★★★★☆Center for AI SafetyCenter for AI Safety (CAIS) Research PublicationsThis is the research portal for CAIS, a prominent AI safety organization known for influential work including the AI safety statement signed by leading researchers; useful as an index of ongoing safety-focused empirical and conceptual research.The Center for AI Safety (CAIS) publishes both technical and conceptual research aimed at mitigating high-consequence, societal-scale risks from AI. Their technical work focuses...ai-safetyevaluationtechnical-safetyexistential-risk+6Source ↗ | Technical papers and reports |
| Blog | CAIS Blog↗🔗 web★★★★☆Center for AI SafetyCenter for AI Safety (CAIS) BlogCAIS is one of the most prominent AI safety organizations; their blog serves as a hub for both technical research and policy discussion, and is a useful resource for tracking current thinking in the field.The official blog of the Center for AI Safety (CAIS), a leading AI safety research organization focused on reducing societal-scale risks from advanced AI systems. The blog publi...ai-safetyexistential-riskalignmentgovernance+4Source ↗ | Research updates and commentary |
| Courses | ML Safety Course↗🔗 webIntro to ML Safety CourseCreated by the Center for AI Safety, this is one of the most comprehensive structured curricula for technical AI safety, suitable for ML practitioners seeking a systematic introduction to safety concepts and methods.A structured university-level course on machine learning safety developed by the Center for AI Safety, covering topics from robustness and anomaly detection to alignment and sys...ai-safetyalignmenttechnical-safetyinterpretability+5Source ↗ | Educational materials on machine learning safety |
Key Research Papers
| Paper | Year | Description |
|---|---|---|
| Unsolved Problems in ML Safety↗📄 paper★★★☆☆arXivUnsolved Problems in ML SafetyFoundational paper providing a comprehensive roadmap of unsolved technical problems in ML safety, addressing emerging challenges from large-scale models and establishing research priorities for the field.Dan Hendrycks, Nicholas Carlini, John Schulman et al. (2021)This paper presents a comprehensive roadmap for ML safety research, identifying four critical problem areas that the field must address as machine learning systems grow larger a...alignmentcapabilitiessafetyai-safety+1Source ↗ | 2022 | Research agenda taxonomy; citation counts should be verified via Google Scholar or Semantic Scholar |
| MACHIAVELLI Benchmark↗📄 paper★★★☆☆arXivMACHIAVELLI datasetMACHIAVELLI is a benchmark dataset of 134 Choose-Your-Own-Adventure games designed to measure Machiavellian behaviors (power-seeking, deception) in AI agents and language models, addressing concerns about unaligned incentives in reward maximization and next-token prediction.Alexander Pan, Jun Shern Chan, Andy Zou et al. (2023)192 citationsMACHIAVELLI is a benchmark dataset of 134 Choose-Your-Own-Adventure games with over 500,000 scenarios designed to evaluate whether AI agents naturally learn Machiavellian behavi...capabilitiessafetydeceptionevaluation+1Source ↗ | 2023 | Evaluation framework for goal-directed AI behavior in game environments |
| Representation Engineering↗📄 paper★★★☆☆arXivRepresentation Engineering: A Top-Down Approach to AI TransparencyThis paper introduces representation engineering, a method for enhancing AI transparency by analyzing and manipulating population-level representations in deep neural networks, directly addressing the interpretability and control challenges central to AI safety.Andy Zou, Long Phan, Sarah Chen et al. (2023)831 citationsThis paper introduces representation engineering (RepE), a top-down approach to AI transparency that analyzes population-level representations in deep neural networks rather tha...interpretabilitysafetyllmai-safety+1Source ↗ | 2023 | Methods for reading and steering AI model internal representations |
Related Organizations
- Technical Safety Research: MIRI, CHAI, Redwood Research
- Evaluations: ARC Evaluations, METR
- Policy Focus: GovAI, RAND Corporation↗🔗 web★★★★☆RAND CorporationRAND: AI and National SecurityRAND is a major U.S. think tank with significant influence on government AI policy; their research often shapes defense and national security AI guidelines, making it a key reference for governance and policy-oriented AI safety work.RAND Corporation's AI research hub covers policy, national security, and governance implications of artificial intelligence. It aggregates reports, analyses, and commentary on A...governancepolicyai-safetyexistential-risk+3Source ↗
- Industry Labs: Anthropic, OpenAI, Google DeepMind
- Funders: Coefficient Giving
References
OpenAI is a leading AI research and deployment company focused on building advanced AI systems, including GPT and o-series models, with a stated mission of ensuring artificial general intelligence (AGI) benefits all of humanity. The homepage serves as a gateway to their research, products, and policy work spanning capabilities and safety.
This resource appears to be a blog post from the Center for AI Safety (CAIS) about Representation Engineering, a technique for understanding and controlling AI model internals. However, the page is currently unavailable (404 error), so the specific content cannot be assessed.
A concise open letter coordinated by the Center for AI Safety stating that mitigating extinction-level risk from AI should be a global priority alongside pandemics and nuclear war. The statement has been signed by hundreds of leading AI researchers, executives, and public figures including Geoffrey Hinton, Yoshua Bengio, Sam Altman, and Demis Hassabis, lending significant institutional credibility to existential AI risk concerns.
The Center for AI Safety (CAIS) publishes both technical and conceptual research aimed at mitigating high-consequence, societal-scale risks from AI. Their technical work focuses on safety benchmarks, robustness, machine ethics, and biosecurity, while their conceptual research draws on philosophy, safety engineering, and international relations to understand AI risk.
5Representation Engineering: A Top-Down Approach to AI TransparencyarXiv·Andy Zou et al.·2023·Paper▸
This paper introduces representation engineering (RepE), a top-down approach to AI transparency that analyzes population-level representations in deep neural networks rather than individual neurons. Drawing from cognitive neuroscience, RepE provides methods for monitoring and manipulating high-level cognitive phenomena in large language models. The authors demonstrate that RepE techniques can effectively address safety-relevant problems including honesty, harmlessness, and power-seeking behavior, offering a promising direction for improving AI system transparency and control.
A structured university-level course on machine learning safety developed by the Center for AI Safety, covering topics from robustness and anomaly detection to alignment and systemic safety. The course includes lecture recordings, slides, notes, and coding assignments across modules on safety engineering, robustness, monitoring, alignment, and emerging risks.
MACHIAVELLI is a benchmark dataset of 134 Choose-Your-Own-Adventure games with over 500,000 scenarios designed to evaluate whether AI agents naturally learn Machiavellian behaviors like power-seeking, deception, and ethical violations when trained to maximize reward. The authors use language models for automated scenario labeling and mathematize dozens of harmful behaviors to evaluate agents' tendencies. Their findings reveal a tension between reward maximization and ethical behavior, but demonstrate that agents can be steered toward less harmful actions through LM-based methods, suggesting that designing agents that are simultaneously safe and capable is achievable.
8[2103.03874] Measuring Mathematical Problem Solving With the MATH DatasetarXiv·Dan Hendrycks et al.·2021·Paper▸
This paper introduces MATH, a benchmark of 12,500 competition mathematics problems with step-by-step solutions, revealing that large Transformer models achieve surprisingly low accuracy and that scaling alone is insufficient for mathematical reasoning. The authors also release an auxiliary pretraining dataset to aid mathematical learning. The work highlights a fundamental gap between current scaling trends and genuine mathematical reasoning ability.
The official blog of the Center for AI Safety (CAIS), a leading AI safety research organization focused on reducing societal-scale risks from advanced AI systems. The blog publishes research updates, policy commentary, and educational content on AI safety topics including existential risk, alignment, and governance.
The Center for AI Safety (CAIS) is a research organization focused on mitigating catastrophic and existential risks from advanced AI systems. It conducts technical research, publishes surveys and statements, and supports field-building efforts across academia and industry. CAIS is notable for its broad coalition-building, including its widely-cited statement on AI extinction risk signed by leading researchers.
Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its family of AI assistants, with a stated mission of responsible development and maintenance of advanced AI for long-term human benefit.
RAND Corporation's AI research hub covers policy, national security, and governance implications of artificial intelligence. It aggregates reports, analyses, and commentary on AI risks, military applications, and regulatory frameworks from one of the leading U.S. defense and policy think tanks.
Google Scholar profile for Stuart Russell, professor at UC Berkeley and one of the most influential figures in AI safety research. Russell is co-author of the leading AI textbook 'Artificial Intelligence: A Modern Approach' and author of 'Human Compatible,' which argues for a fundamental redesign of AI around human preferences and uncertainty. His research spans AI alignment, inverse reward design, and the long-term risks of advanced AI systems.
14Unsolved Problems in ML SafetyarXiv·Dan Hendrycks, Nicholas Carlini, John Schulman & Jacob Steinhardt·2021·Paper▸
This paper presents a comprehensive roadmap for ML safety research, identifying four critical problem areas that the field must address as machine learning systems grow larger and are deployed in high-stakes applications. The authors categorize safety challenges into Robustness (withstanding hazards), Monitoring (identifying hazards), Alignment (reducing inherent model hazards), and Systemic Safety (reducing systemic hazards). By clarifying the motivation behind each problem and providing concrete research directions, the paper aims to guide the ML safety research community toward addressing emerging safety challenges posed by large-scale models.
Wikipedia's overview of the Center for AI Safety (CAIS), a nonprofit organization focused on reducing societal-scale risks from advanced AI systems. CAIS is known for publishing the 2023 statement on AI extinction risk signed by hundreds of leading AI researchers and for conducting technical safety research. The article covers the organization's founding, mission, key initiatives, and notable figures involved.
The Center for AI Safety (SAFE) is a nonprofit organization focused on reducing societal-scale risks from advanced AI systems. The about page outlines their mission, team, and core research and advocacy activities aimed at ensuring AI development benefits humanity. They work across technical safety research, policy engagement, and public education.