FAR AI
FAR AI
FAR AI is an AI safety research nonprofit founded in July 2022 by Adam Gleave (CEO) and Karl Berzins (Co-founder & President). Based in Berkeley, California, the organization conducts technical research in adversarial robustness, model evaluation, interpretability, and alignment. Notable work includes demonstrating that adversarial policies can defeat superhuman Go AIs and co-authoring the 'Towards Guaranteed Safe AI' framework. FAR AI reported $24.3M in FY2024 revenue and secured over $30M in 2025 funding commitments from funders including Coefficient Giving (previously Open Philanthropy), Schmidt Sciences, and the Survival and Flourishing Fund. In early 2026, FAR AI was selected by the European Commission's AI Office to lead CBRN risk research under tender EC-CNECT/2025/OP/0032. The organization also operates FAR.Labs (a Berkeley coworking space with 40+ members) and a $12M grantmaking program.
Overview
FAR AI (far.ai) is an AI safety research nonprofit founded in July 2022 by Adam Gleave (CEO) and Karl Berzins (Co-founder & President).1 Adam Gleave completed his PhD in AI at UC Berkeley, advised by Stuart Russell.2 The organization's mission is to ensure AI systems are trustworthy and beneficial to society.3 FAR AI incorporated in October 2022 as a 501(c)(3) nonprofit (EIN 92-0692207), having initially operated as a fiscally sponsored project.4
FAR AI conducts technical research in areas including adversarial robustness, Interpretability, model evaluation, and alignment, focusing on fundamental AI safety challenges described as too large or resource-intensive for academia.5 Notable results include adversarial policies that achieved a >99% win rate against the superhuman Go AI KataGo when it uses no tree search, and a >77% win rate even when KataGo uses superhuman-level search—policies that are themselves easily beaten by human amateur players, suggesting that high capability does not guarantee robustness in adversarial settings.6 FAR AI co-authored the "Towards Guaranteed Safe AI" framework paper published in May 2024.5 In early 2026, FAR AI was selected by the European Commission's AI Office to lead CBRN risk research under tender EC-CNECT/2025/OP/0032.7
The organization has grown to 40+ total staff as of early 2026, with a technical team of approximately 15 researchers and plans to scale to 30+ researchers.8 Financial details and program structure are described in the sections below.
Key Research Areas
Adversarial Robustness
| Research Focus | Approach | Safety Connection | Publications |
|---|---|---|---|
| Adversarial Attacks on Go AI | Training adversarial policies against KataGo | Superhuman systems remain exploitable by adversarial inputs | "Adversarial Policies Beat Superhuman Go AIs" (2023)6 |
| Go AI Defense Analysis | Testing adversarial training, iterated adversarial training, vision transformers | None of three tested defenses withstood adaptive attacks | "Can Go AIs be adversarially robust?" (2024)9 |
| LLM Adversarial Training | Adversarial training vs. scaling for robustness | Orders-of-magnitude efficiency gains over scaling alone | FAR.AI robustness research10 |
| Multi-layer Defense Bypass | STACK (STaged AttaCK) method against layered AI defenses | Identifies gaps in defense-in-depth strategies | 71% attack success rate on ClearHarm dataset11 |
FAR AI's research in adversarial robustness has produced several empirical results. A 2023 paper, "Adversarial Policies Beat Superhuman Go AIs," demonstrated that adversarial policies achieved a >99% win rate against KataGo when it uses no tree-search, and a >77% win rate even when KataGo uses superhuman-level search.6 The adversarial policies win by inducing blunders in KataGo rather than by playing stronger Go, and are themselves easily beaten by human amateur players.6 A follow-up 2024 paper, "Can Go AIs be adversarially robust?," tested three natural defenses—positional adversarial training, iterated adversarial training, and vision transformer architectures—and found that none could withstand adaptive adversarial attacks.9 FAR AI has also found that adversarial training improves language model robustness orders of magnitude more efficiently than scaling model size alone, and that larger language models are more vulnerable to data poisoning, a result demonstrated across 23 LLMs from 8 model series.10
The STACK (STaged AttaCK) method, documented in a paper co-authored with UK AI Safety Institute researchers (arXiv:2506.24068, submitted June 2025), achieved a 71% attack success rate on the ClearHarm dataset in black-box attacks against multi-layered classifier pipelines, compared to 0% for conventional attacks against the same defenses.11 The paper's authors from FAR AI include Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, and Adam Gleave; collaborating researchers from UK AISI include Xander Davies, Stephen Casper, Aaron D. Tucker, and Robert Kirk.11
Research Programs
| Program | Purpose | Details |
|---|---|---|
| FAR.Labs | Co-working space | Berkeley-based AI safety research hub with 40+ active members1 |
| Grantmaking | Fund external research | $12 million from Coefficient Giving (formerly Open Philanthropy) supports academics and independent researchers12 |
| Events & Workshops | Convene stakeholders | 1,000+ attendees across 10+ events hosted1 |
| In-house Research | Technical safety work | Robustness, interpretability, alignment; 30+ research papers published1 |
The grantmaking program funds external researchers working on AI safety problems, with four initial grants targeting robustness across data poisoning and model stealing, automated alignment testing, weak-to-strong generalization, and alignment security against jailbreaks and finetuning attacks.12 Individual recipient names are not publicly disclosed; grantees are identified through expert nomination rather than public calls for proposals, though FAR AI has indicated plans to launch public RFPs in future cycles.12 FAR.Labs provides physical co-working space and community infrastructure for independent researchers. The events program hosts workshops and convenings, including the inaugural Technical Innovations for AI Policy Conference and international alignment workshops.13
LLM Red-Teaming and Model Evaluation
FAR AI began red-teaming leading language models for frontier labs in Q4 2023, including red-teaming of GPT-4.1 The organization delivers research through peer-reviewed publications, governmental partnerships, and red-teaming engagements.3 As of early 2026, publicly confirmed red-teaming engagements include work with OpenAI and the EU AI Office CBRN evaluation role; testing of STACK variants on production systems such as Claude 4 Opus was described as ongoing research under responsible disclosure protocols.1114
Natural Abstractions Research
FAR AI has expressed theoretical interest in natural abstractions research, which explores whether intelligent systems independently converge on similar conceptual representations of the world. This research direction connects to work by MIRI and other organizations investigating whether shared abstractions between human and AI cognition could provide a foundation for alignment approaches. This area remains at an early theoretical stage within FAR AI's portfolio and has not yet produced peer-reviewed publications attributed to the organization.
Organizational Structure and Operations
Leadership
FAR AI was founded in July 2022 by Adam Gleave and Karl Berzins.1 Gleave serves as Co-founder & CEO; Berzins serves as Co-founder & President.1516 Berzins held the title of COO as of December 2023, with his title transitioning to President thereafter, as reflected on FAR AI's official website.1715 Gleave completed his PhD in artificial intelligence at UC Berkeley under Stuart Russell's supervision, and his research focuses on developing techniques for AI systems to act according to human preferences.18 According to public 990 filings, Gleave's compensation was $229,331 and Berzins's was $182,641 in fiscal year 2024.19
Organizational Structure
FAR AI is incorporated as a 501(c)(3) nonprofit with EIN 92-0692207, tax-exempt since November 2020.19 The organization maintains a policy capping revenue from for-profit AI developers at a maximum of 10% of total annual revenue, and charges market rates to avoid subsidizing private actors.4 FAR AI reported $24.3 million in revenue and $8.6 million in expenses for fiscal year 2024, per its Form 990 filed November 14, 2025.19
Research Team and Staffing
FAR AI's staffing has grown substantially since its founding. As of December 2023, the organization had 12 full-time staff (approximately 11.5 FTEs), including 5 technical staff, a 3-person operations team, and a 1.5 FTE communications team.17 By the time of the $30M+ funding announcement in 2025, the technical research team had grown to approximately 15 researchers, with plans to scale to 30+ technical researchers over the following 12–18 months.8 As of early 2026, FAR AI's job postings describe the organization as having 40+ total staff.20 The organization maintains Operations, Communications, and Technical Staff departments.
FAR.Labs Co-working Space
FAR.Labs is a co-working hub located in downtown Berkeley, opened in March 2023, and now houses 40+ active members working on AI safety problems.17 Coefficient Giving (previously Open Philanthropy) provided $1.7 million over three years specifically to support the FAR.Labs coworking hub.21
Current State & Trajectory
As of early 2026, FAR AI has expanded to 40+ staff with plans to scale the technical team from 15 to 30+ researchers.208 Key recent developments include the $30M+ multi-funder commitment in 2025,8 the launch of a $12M grantmaking program in Q3 2024,22 and selection by the European Commission's AI Office to lead CBRN risk research in early 2026.7
Looking ahead, FAR AI plans to launch public requests for proposals focused on high-impact research areas, broadening access to its grantmaking program beyond nomination-only pathways.1
Strategic Position Analysis
Organizational Comparisons
FAR AI conducts empirical technical AI safety research without developing or deploying AI products of its own.1 Its work emphasizes adversarial robustness of deployed systems — demonstrating, for example, that superhuman Go AIs can be defeated by adversarial policies despite their capabilities.4 FAR AI caps revenue from for-profit AI developers at 10% of total annual revenue to preserve research independence.10
| Organization | Focus | Overlap with FAR AI | Differentiation |
|---|---|---|---|
| Anthropic | Constitutional AI, frontier model development | Safety research, red-teaming | FAR AI does not develop or deploy models; revenue from for-profits capped at 10% |
| ARC | Theoretical alignment research | Alignment goals | FAR AI uses empirical ML methods (e.g., adversarial training experiments) rather than formal theory |
| METR | Model evaluation | Safety assessment, red-teaming | FAR AI additionally researches adversarial robustness defenses and value alignment |
| Academic Labs | ML research | Technical methods, publication venues | FAR AI focuses on safety-specific problems described as too large or resource-intensive for academia |
Positioning in AI Safety Ecosystem
FAR AI publishes at mainstream ML venues (NeurIPS, ICML, ICLR) to reach audiences beyond specialized safety communities.3 The organization focuses on fundamental AI safety challenges described as too large or resource-intensive for academia alone.3 FAR AI's research has found that even superhuman AI systems can fail against adversarial attacks, and that none of three tested defense strategies could withstand adaptive adversaries.9
FAR AI's position within the Berkeley AI safety ecosystem is reinforced by FAR.Labs and its grantmaking program, which together support researchers across institutional boundaries.12 The organization's adversarial robustness agenda connects near-term safety concerns about deployed systems to longer-term alignment challenges.12
Research Impact and Influence
Academic Reception
FAR AI has published over 30 research papers across robustness, value alignment, and model evaluation.1 Research appears at top-tier venues including NeurIPS, ICML, and ICLR.17 The KataGo adversarial policies paper (2023) has accumulated 99 citations on Google Scholar; the earlier foundational adversarial policies paper published at ICLR 2020—predating FAR AI's founding—has amassed 555 citations, reflecting its influence in the broader field.23 A 2025 multi-agent risks paper has reached 113 citations.23 FAR AI's research has been cited in congressional testimony and mainstream media.3
The 2024 paper "Towards Guaranteed Safe AI," co-authored by 17 contributors including Yoshua Bengio and Stuart Russell, proposed a safety framework using world models, safety specifications, and verifiers.5
Policy Engagement
In early 2026, FAR AI was selected by the European Commission's AI Office to lead Lot 1 (CBRN Risk Modelling and Evaluation) of tender EC-CNECT/2025/OP/0032, titled "Artificial Intelligence Act: Technical Assistance for AI Safety."7 The contract has a three-year duration; the total tender value across all six lots is €9,080,000.24 FAR AI leads a consortium including SecureBio (biological threat assessment) and SaferAI (AI governance and risk modeling), with subcontractors GovAI, Nemesys Insights, and Equistamp.7 FAR AI also participates as a subcontractor on Lot 4 (Harmful Manipulation Risk).7
FAR AI launched the inaugural Technical Innovations for AI Policy Conference on May 31–June 1, 2025, convening over 150 technical experts, researchers, and policymakers in Washington, D.C.4 FAR AI's research on robustness and evaluation is relevant to ongoing AI governance discussions.
Research Questions and Uncertainties
Theoretical Questions
Several theoretical questions shape FAR AI's research direction:
-
Natural Abstractions Validity: Whether intelligent systems independently converge on similar conceptual representations remains an open empirical question. The natural abstractions hypothesis has theoretical appeal but requires extensive empirical validation across diverse AI architectures and training regimes.
-
Robustness-Alignment Connection: The relationship between adversarial robustness and value alignment is not fully understood. While robustness may be necessary for aligned systems, the degree to which robustness research directly contributes to solving alignment problems remains debated within the AI safety community.
-
Scaling Dynamics: Whether current robustness and evaluation approaches will remain relevant as AI systems increase in capability is uncertain. Some safety researchers argue that qualitatively new challenges emerge at higher capability levels that may not be addressed by current methodologies.
Organizational Uncertainties
-
Research Timeline: Academic publication timelines typically span months to years, including peer review, revision, and conference scheduling. Whether this research pace adequately matches the urgency of safety concerns depends on assessments of timelines for Transformative AI development.
-
Scope Evolution: FAR AI's research focus may evolve as the field develops. The organization's emphasis on empirical robustness could shift toward other safety approaches depending on which problems prove most tractable or urgent.
-
Policy Engagement: The extent of FAR AI's involvement in AI governance and policy discussions may expand beyond its current focus on technical research and convening activities.
Field-Wide Debates
| Debate | FAR AI Approach | Alternative Views |
|---|---|---|
| Value of robustness for alignment | Robustness research treated as relevant to safety | Some researchers see limited connection to core alignment |
| Natural abstractions importance | Theoretical interest in the concept | Others view the hypothesis as speculative without strong evidence |
| Academic vs. applied research | Maintains academic publication model | Some argue industry-facing applied research is more impactful |
| Benchmark limitations | Benchmark development as part of research program | Others raise fundamental Goodhart's Law concerns |
These debates reflect broader disagreements within the AI safety community about research priorities, timelines, and the relationship between different technical approaches to safety.
Funding & Sustainability
Current Funding Model
FAR AI's funding profile reflects concentration among sources within the effective altruism ecosystem. Coefficient Giving (which rebranded from Coefficient Giving in November 2025)25 is the principal funder; FAR AI's own press materials refer to the funder as "Coefficient Giving (previously Coefficient Giving)."8 Coefficient Giving has provided multiple distinct grants to FAR AI: approximately $28.675 million over three years for research team expansion, a technical internship and fellowship program, and a governance team; $6.65 million over two years for FAR.Futures (events, outreach, and field-building); $2.16 million over three years for general support; and $1.7 million over three years for the FAR.Labs coworking hub.21 A separate $12 million grant from Open Philanthropy (now Coefficient Giving) funds FAR AI's grantmaking program for external researchers.12
Additional funders announced in FAR AI's 2025 funding announcement include Schmidt Sciences, the Survival and Flourishing Fund (SFF), the Center for Security and Emerging Technology (CSET), and the AI Safety Fund (AISF) supported by the Frontier Model Forum.8 FAR AI's job postings reference total funding of over $40 million, suggesting the cumulative figure exceeds the $30M+ figure cited in the formal press release.20 Good Ventures, Coefficient Giving's partner foundation, supports FAR AI as part of its Navigating Transformative AI focus area.23
In fiscal year 2024, FAR AI reported $24.3 million in revenue and $8.6 million in expenses, per its Form 990 filed November 14, 2025.19
Revenue from for-profit AI developers is capped at a maximum of 10% of FAR AI's total annual revenue, and FAR AI commits to disclosing the fraction of revenue derived from such consulting.4 This concentration of philanthropic funding creates exposure to shifts in funder priorities; it also provides multi-year commitments that enable longer-horizon research planning. The $12 million grantmaking allocation is directed toward supporting academics and independent researchers working on critical AI safety problems.1
Criticisms and Responses
Academic Pace Concerns
Criticism: Academic publication processes operate on timelines of months to years, including peer review, revision, and conference scheduling. Critics argue this pace may be too slow given rapid AI capability advances and the urgency of safety concerns.
Response: Proponents of the peer-reviewed publication model argue it ensures research quality and credibility, and that methodology and evaluation frameworks developed through careful research provide lasting value even as specific techniques evolve. Preprint sharing and direct collaboration with AI labs can accelerate impact for time-sensitive findings.
Context: This tension between research rigor and speed affects the broader AI safety field. Different organizations make different tradeoffs between publication quality, speed, and direct industry impact.
Limited Scope Questions
Criticism: Research on adversarial robustness and evaluation may not directly address core alignment challenges like deceptive alignment, goal specification, or value learning. Critics question whether robustness research provides sufficient traction on harder alignment problems. A pointed version of this concern is that demonstrating vulnerabilities in narrow systems — such as the finding that superhuman Go AIs can be beaten by adversarial policies playing cyclic patterns that amateur humans easily defeat1 — illuminates failure modes without providing a clear path to preventing them in more capable systems.4
Response: FAR AI argues robustness is a necessary foundation for aligned systems, noting that "most alignment proposals using helper ML systems will fail if helpers are exploited by main systems."19 FAR AI's portfolio also includes value alignment work that produced more sample-efficient value learning algorithms13 and contributed to the "Towards Guaranteed Safe AI" framework co-authored with researchers including Yoshua Bengio and Stuart Russell.5
Context: The AI safety field contains diverse views on which research directions are most valuable. Some researchers emphasize near-term robustness and evaluation, while others focus on long-term theoretical alignment challenges. FAR AI's own findings that adversarial training improves language model robustness orders of magnitude more efficiently than scaling alone4 are cited as evidence that empirical robustness work can yield generalizable insights.
Natural Abstractions Theory Concerns
Criticism: The natural abstractions hypothesis lacks extensive empirical validation. Critics argue that theoretical frameworks should be grounded in experimental evidence before receiving substantial research attention.
Response: Proponents argue theoretical frameworks can productively guide empirical research programs, and that the multi-year timeline for validation is appropriate given the scope of the hypothesis.
Context: Disagreement about when to invest in theoretical versus empirical work is common in early-stage scientific fields. Different researchers make different judgments about the appropriate balance.
External Links
- FAR.AI — official website
- FAR.AI Research — publications and papers
- FAR.AI Programs — grantmaking, events, FAR.Labs
- FAR.AI Transparency — financial disclosures and policies
Footnotes
-
About | FAR.AI (https://far.ai/about) ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11
-
Adam Gleave - AI2050 (https://ai2050.schmidtsciences.org/fellow/adam-gleave) ↩
-
Research Overview – FAR.AI (https://far.ai/research) ↩ ↩2 ↩3 ↩4 ↩5
-
Transparency | FAR.AI (https://far.ai/about/transparency) ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
Towards Guaranteed Safe AI | FAR.AI (https://far.ai/research/towards-guaranteed-safe-ai-a-framework-for-ensuring-robust-and-reliable-ai-systems) ↩ ↩2 ↩3 ↩4
-
Adversarial Policies Beat Superhuman Go AIs | FAR.AI (https://far.ai/research/adversarial-policies-beat-superhuman-go-ais) ↩ ↩2 ↩3 ↩4
-
FAR.AI Selected to Lead EU AI Act CBRN Risk Consortium — FAR AI official announcement, February 2026 ↩ ↩2 ↩3 ↩4 ↩5
-
FAR.AI Secures Over $30 Million in Multi-Funder Support to Scale Frontier AI Safety Research — FAR AI official press release, 2025 ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
Can Go AIs be adversarially robust? – FAR.AI (https://far.ai/research/can-go-ais-be-adversarially-robust) ↩ ↩2 ↩3
-
FAR.AI Robustness Research (https://far.ai/topic/robustness) ↩ ↩2 ↩3
-
McKenzie, I.R., Hollinsworth, O.J., Tseng, T., Davies, X., Casper, S., Tucker, A.D., Kirk, R., Gleave, A. "STACK: Adversarial Attacks on LLM Safeguard Pipelines." arXiv:2506.24068, submitted June 30, 2025; revised February 5, 2026. (https://arxiv.org/abs/2506.24068) ↩ ↩2 ↩3 ↩4
-
Grantmaking | FAR.AI — FAR AI grantmaking program page ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
2023 Alignment Research Updates – FAR.AI (https://far.ai/post/2023-12-far-research-update) ↩ ↩2
-
Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations — FAR AI blog, companion post to STACK paper ↩
-
Karl Berzins | FAR.AI — FAR AI official profile listing Berzins as "President of FAR.AI" ↩ ↩2
-
Karl Berzins — LinkedIn profile (https://www.linkedin.com/in/karlberzins/), listing title as "Co-founder & President at FAR.AI" ↩
-
What's new at FAR AI — EA Forum, FAR AI, December 4, 2023 ↩ ↩2 ↩3 ↩4
-
Adam Gleave | FAR.AI (https://far.ai/author/adam-gleave) ↩
-
Far Ai Inc - Nonprofit Explorer - ProPublica (https://projects.propublica.org/nonprofits/organizations/920692207) ↩ ↩2 ↩3 ↩4 ↩5
-
FAR.AI Research Scientist — Careers Page — job posting describing organization as having "grown quickly to 40+ staff" ↩ ↩2 ↩3
-
Open Philanthropy grant database: FAR.AI — AI Field Building (2025), August 6, 2025; and FAR.AI — AI Safety Research and Field-Building, September 28, 2025 ↩ ↩2
-
Programs – FAR.AI (https://far.ai/programs) ↩
-
Adam Gleave — Google Scholar (https://scholar.google.com/citations?user=lBunDH0AAAAJ&hl=en) ↩ ↩2 ↩3
-
EU AI Act Newsletter #77: AI Office Tender, May 13, 2025 — total tender value €9,080,000 across all six lots ↩
-
Open Philanthropy Is Now Coefficient Giving — Coefficient Giving official announcement, December 10, 2025; EA Forum post by Alexander Berger, November 18, 2025 ↩
References
1[2009.03300] Measuring Massive Multitask Language UnderstandingarXiv·Dan Hendrycks et al.·2020·Paper▸
Introduces the MMLU benchmark, a comprehensive evaluation suite covering 57 subjects across STEM, humanities, social sciences, and more, designed to measure breadth and depth of language model knowledge. The benchmark tests models from elementary to professional level and reveals significant gaps between human expert performance and state-of-the-art models at the time of publication. It became a standard benchmark for tracking LLM capability progress.
This is a YouTube search results page aggregating talks and lectures by Dan Hendrycks, a prominent AI safety researcher and director of the Center for AI Safety. His talks cover topics including ML robustness, distributional shift, AI risk, and safety benchmarks like MMLU and HELM.
FAR.AI is an AI safety research non-profit founded in 2022 by Adam Gleave and Karl Berzins, focused on technical breakthroughs in AI safety and fostering global coordination. The organization conducts in-house research, runs events connecting policymakers and researchers, operates a Berkeley-based co-working space (FAR.Labs), and administers targeted grants to academic groups.
Google Scholar profile for Adam Gleave, an AI safety researcher affiliated with FAR.AI, with 7,467 citations and an h-index of 20. His research spans inverse reinforcement learning, reward modeling, adversarial policies, and AI safety evaluation. He collaborates with prominent researchers including Stuart Russell, Sergey Levine, and Kellin Pelrine.
ProPublica's Nonprofit Explorer page for FAR AI Inc (Foundation for Alignment Research), a 501(c)(3) AI safety research organization based in San Diego, CA. The page provides financial data showing the organization had $24.3M in revenue and $8.6M in expenses in 2024, with 95.7% of revenue from contributions. Key personnel include Adam Gleave as President/CEO.
FAR.AI (Fund for Alignment Research) is a nonprofit technical AI safety research institute that conducts and incubates research on frontier AI safety challenges, shares findings through peer-reviewed publications and government partnerships, and engages in red-teaming with leading AI companies. Their research portfolio covers model evaluation, adversarial robustness, and persuasion risks, with the goal of delivering novel technical breakthroughs where academia and commercial sector fall short.
This paper introduces ETHICS, a benchmark dataset for evaluating language models' understanding of moral concepts across justice, well-being, duties, virtues, and commonsense morality. The dataset requires models to predict human moral judgments about diverse text scenarios, combining physical and social world knowledge with value judgments. The authors find that current language models demonstrate promising but incomplete ability to predict ethical judgments, suggesting that progress on machine ethics is achievable and could help align AI systems with human values.
FAR.AI summarizes their 2023 alignment research across three agendas: a science of robustness agenda that discovered vulnerabilities in superhuman Go systems, value alignment research producing more sample-efficient value learning algorithms, and model evaluation work developing both black-box and white-box evaluation methods.
This paper introduces the 'Guaranteed Safe (GS) AI' framework, which aims to equip AI systems with high-assurance quantitative safety guarantees through three core components: a world model, a safety specification, and a verifier. The authors outline approaches for building each component, identify key technical challenges, and argue that this formal verification-based approach is necessary while critiquing alternative safety approaches.
10Dan Hendrycks on AI Safety and Existential Risk (Future of Life Institute Podcast)Future of Life Institute▸
A Future of Life Institute podcast episode featuring Dan Hendrycks discussing AI safety, existential risk from advanced AI systems, and practical approaches to making AI more robust and aligned. Hendrycks, known for his work on benchmarks like MMLU and safety research at the Center for AI Safety, shares his perspectives on near-term and long-term AI risks.
This paper extends adversarial and virtual adversarial training to text domains by applying perturbations to word embeddings in recurrent neural networks rather than to sparse one-hot input representations. The authors demonstrate that this approach is more suitable for text data and achieves state-of-the-art results on both semi-supervised and supervised benchmark tasks. The method improves word embedding quality and reduces overfitting during training, with code made publicly available.
Academic homepage and CV of Dan Hendrycks, Director of the Center for AI Safety (CAIS) and prominent ML safety researcher. Hendrycks is known for foundational work on AI robustness, anomaly detection, and safety benchmarks including MMLU, MATH, and ARC-Challenge. His research spans technical AI safety, AI risk, and policy-relevant evaluations.
FAR AI provides an organizational update on its progress since founding in July 2022, describing its growth to 12 staff, 13 academic papers, and three core pillars: AI safety research, a Berkeley coworking space (FAR Labs), and field-building for ML researchers. The post highlights FAR's distinctive portfolio approach to safety research, targeting agendas too large for individuals but misaligned with for-profit incentives.
FAR.AI researchers demonstrate that adversarial policies can reliably defeat superhuman Go AIs (including KataGo) by exploiting unexpected weaknesses, despite the target AI vastly outperforming the adversary in normal play. This work shows that even highly capable AI systems can harbor surprising, exploitable blind spots that humans can identify but the AI cannot defend against.
FAR.AI (Frontier Alignment Research) is an AI safety research non-profit focused on technical breakthroughs in AI alignment and fostering global collaboration. The organization conducts research aimed at ensuring advanced AI systems are safe and aligned with human values. It serves as an institutional hub for safety-focused technical research at the frontier of AI capabilities.
GitHub profile of Dan Hendrycks, a prominent ML safety researcher and director of the Center for AI Safety. His public repositories include foundational benchmarks and datasets for evaluating robustness, out-of-distribution detection, and AI safety, including MMLU, ARC, and corruption robustness benchmarks.
FAR.AI (Foundational Alignment Research) outlines its core research and training programs aimed at advancing AI safety. The page describes initiatives focused on technical alignment research, researcher training, and collaborative projects designed to reduce risks from advanced AI systems.
18[2103.03874] Measuring Mathematical Problem Solving With the MATH DatasetarXiv·Dan Hendrycks et al.·2021·Paper▸
This paper introduces MATH, a benchmark of 12,500 competition mathematics problems with step-by-step solutions, revealing that large Transformer models achieve surprisingly low accuracy and that scaling alone is insufficient for mathematical reasoning. The authors also release an auxiliary pretraining dataset to aid mathematical learning. The work highlights a fundamental gap between current scaling trends and genuine mathematical reasoning ability.
Profile of Adam Gleave as an AI2050 Fellow funded by Schmidt Sciences, highlighting his research focus on AI safety and alignment. Gleave is known for his work on reward modeling, adversarial policies, and evaluation of AI systems. This page serves as a brief professional overview within the AI2050 fellowship program.
The news and blog page for FAR.AI (Foundational Alignment Research), an AI safety research organization. It serves as a hub for their published research updates, announcements, and commentary on AI alignment and safety topics.
This FAR.AI research investigates whether Go-playing AI systems like KataGo can be made robust against adversarial attacks, following prior work showing that human-level Go AIs can be defeated by specially crafted adversarial strategies. The research explores potential defenses and their limitations, contributing to broader understanding of adversarial robustness in AI systems.
Author page for Adam Gleave at FAR.AI (Foundational Research for AI Safety), listing his published research and contributions to AI safety. Gleave is a prominent AI safety researcher known for work on adversarial policies, reward modeling, and scalable oversight.
This is a Google Scholar profile page for an unidentified researcher (user ID: 8Q1x_kEAAAAJ) with no accessible content provided. Based on the existing tags, the researcher likely works in adversarial robustness, ML safety, and benchmarking domains.
The official homepage of UC Berkeley, a leading public research university. Berkeley hosts several prominent AI safety and ML research groups, including the Center for Human-Compatible AI (CHAI) and the Algorithmic Foundations of Data Science Institute. It is a major institutional contributor to both AI capabilities and safety research.
This is the transparency page for FAR AI (Fund for Alignment Research), an AI safety research organization. It provides information about the organization's operations, funding, and commitments to openness in its research and governance practices.
FAR AI (Foundation for Alignment Research) runs a grantmaking program to fund AI safety research and related work. The program supports projects aligned with FAR AI's mission of reducing risks from advanced AI systems, providing financial resources to researchers and organizations working on technical and governance challenges in AI safety.
This page from the Foundational Research for AI (FAR.AI) organization presents their research portfolio focused on robustness in AI systems. It covers work on making AI models more reliable and resistant to distribution shifts, adversarial inputs, and unexpected failure modes. The research aims to improve the safety and dependability of AI systems in real-world deployment contexts.