Adversarial Policies Beat Superhuman Go AIs | FAR.AI
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: FAR AI
A key empirical result in AI safety showing that superhuman AI systems can be systematically defeated by adversarial strategies that exploit blind spots, undermining confidence that capability alone implies robustness or safety.
Metadata
Summary
FAR.AI researchers demonstrate that adversarial policies can reliably defeat superhuman Go AIs (including KataGo) by exploiting unexpected weaknesses, despite the target AI vastly outperforming the adversary in normal play. This work shows that even highly capable AI systems can harbor surprising, exploitable blind spots that humans can identify but the AI cannot defend against.
Key Points
- •Adversarial policies were trained to beat KataGo, a superhuman Go AI, by exploiting systematic blind spots rather than outplaying it conventionally.
- •The adversarial strategies are recognizable as losing moves to human players, revealing that superhuman performance does not imply robustness.
- •Findings challenge assumptions that high capability in a domain implies reliable safety or resistance to adversarial attacks.
- •The work has broad implications for AI safety: capable models may have hidden vulnerabilities that are hard to detect through standard evaluation.
- •Demonstrates the importance of adversarial testing and red-teaming as part of AI safety evaluation pipelines.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| FAR AI | Organization | 76.0 |
Cached Content Preview
Adversarial Policies Beat Superhuman Go AIs | FAR.AI
We updated our website and would love your feedback!
Events
Events
Programs
Programs
Blog
About
About
Careers Donate
All Research
/ Robustness
Adversarial Policies Beat Superhuman Go AIs
Full PDF
Project
Source
Blog
Citation
January 9, 2023
Tony Wang
Adam Gleave
Tom Tseng
Kellin Pelrine
Nora Belrose
Joseph Miller
Michael D. Dennis
Yawen Duan
Viktor Pogrebniak
Sergey Levine
Stuart Russell
abstract
We attack the state-of-the-art Go-playing AI system, KataGo, by training adversarial policies that play against frozen KataGo victims. Our attack achieves a >99% win rate when KataGo uses no tree-search, and a >77% win rate when KataGo uses enough search to be superhuman. Notably, our adversaries do not win by learning to play Go better than KataGo -- in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at [goattack.far.ai](goattack.far.ai).
Share on:
Research
Our research explores a portfolio
of high-potential agendas.
Events
Our events bring together
global leaders in AI.
Programs
Our programs build the field of trustworthy and secure AI
Subscribe
Subscribe to our newsletter
Organization About Team Programs News Search
Events All Events Alignment Workshops Specialized Workshops All Event Recordings
Research All Publications Research Overview
Robustness
Interpretability
Model Evaluation
Alignment
Get involved Careers Contact Donate Newsletter
Financial Reports / 990s Privacy Policy Terms of Service
Cookies Notice: This website uses cookies to identify pages that are being used most frequently. This helps us analyze web page traffic and improve our website. We do not and will never sell user data. Read more about our cookie policy on our privacy policy . Please contact us if you have any questions.
© 2025 FAR AI, Inc.
Website by ODW
FAR.AI only uses cookies essential for website functionality and anonymous usage analytics.
I understand
... (truncated, 3 KB total)90d7373e3af4f090 | Stable ID: YjA3NDE4N2