Skip to content
Longterm Wiki
Back

Even Superhuman Go AIs Have Surprising Failure Modes

web

Authors

AdamGleave·EuanMcLean·Tony Wang·Kellin Pelrine·Tom Tseng·Yawen Duan·Joseph Miller·MichaelDennis

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

A widely-cited LessWrong post discussing empirical adversarial vulnerabilities in superhuman Go AIs, often referenced in AI safety discussions as a concrete example of capability without robustness.

Forum Post Details

Karma
131
Comments
22
Forum
lesswrong
Forum Tags
Adversarial Examples (AI)Gaming (videogames/tabletop)Robust AgentsAI

Metadata

Importance: 72/100blog postanalysis

Summary

This post examines how superhuman Go-playing AIs, despite vastly outperforming humans, can still be exploited through adversarial strategies that expose unexpected vulnerabilities. It highlights that high capability in one domain does not guarantee robustness against all possible inputs or strategies, with implications for AI safety and alignment.

Key Points

  • Superhuman Go AIs like KataGo can be defeated by human-crafted adversarial strategies that exploit systematic blind spots.
  • High performance on standard benchmarks does not imply robustness against out-of-distribution or adversarial inputs.
  • These failure modes are surprising because the AIs are far stronger than the humans exploiting them, suggesting capability ≠ robustness.
  • The findings raise concerns about over-trusting AI systems in high-stakes domains based solely on benchmark performance.
  • This serves as a concrete, empirical case study relevant to broader AI safety concerns about misaligned or brittle AI behavior.

Cited by 1 page

PageTypeQuality
Deep Learning Revolution EraHistorical44.0

Cached Content Preview

HTTP 200Fetched Feb 22, 202661 KB
x This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. Even Superhuman Go AIs Have Surprising Failure Modes — LessWrong Adversarial Examples (AI) Gaming (videogames/tabletop) Robust Agents AI Frontpage 130

 Even Superhuman Go AIs Have Surprising Failure Modes 

 by AdamGleave , EuanMcLean , Tony Wang , Kellin Pelrine , Tom Tseng , Yawen Duan , Joseph Miller , MichaelDennis 20th Jul 2023 AI Alignment Forum Linkpost from far.ai 12 min read 22 130

 Ω 34

 In March 2016, AlphaGo defeated the Go world champion Lee Sedol, winning four games to one. Machines had finally become superhuman at Go. Since then, Go-playing AI has only grown stronger. The supremacy of AI over humans seemed assured, with Lee Sedol commenting they are an "entity that cannot be defeated". But in 2022, amateur Go player Kellin Pelrine defeated KataGo , a Go program that is even stronger than AlphaGo. How?

 It turns out that even superhuman AIs have blind spots and can be tripped up by surprisingly simple tricks. In our new paper , we developed a way to automatically find vulnerabilities in a "victim" AI system by training an adversary AI system to beat the victim. With this approach, we found that KataGo systematically misevaluates large cyclically connected groups of stones. We also found that other superhuman Go bots including ELF OpenGo , Leela Zero and Fine Art suffer from a similar blindspot. Although such positions rarely occur in human games, they can be reliably created by executing a straightforward strategy. Indeed, the strategy is simple enough that you can teach it to a human who can then defeat these Go bots unaided.

 

 The victim and adversary take turns playing a game of Go. The adversary is able to sample moves the victim is likely to take, but otherwise has no special powers, and can only play legal Go moves. 

 Our AI system (that we call the adversary ) can beat a superhuman version of KataGo in 94 out of 100 games, despite requiring only 8% of the computational power used to train that version of KataGo. We found two separate exploits: one where the adversary tricks KataGo into passing prematurely, and another that involves coaxing KataGo into confidently building an unsafe circular group that can be captured. Go enthusiasts can read an analysis of these games on the project website .

 Our results also give some general lessons about AI outside of Go. Many AI systems, from image classifiers to natural language processing systems, are vulnerable to adversarial inputs: seemingly innocuous changes such as adding imperceptible static to an image or a distractor sentence to a paragraph can crater the performance of AI systems while not affecting humans. Some have assumed that these vulnerabilities will go away when AI systems get capable enough—and that superhuman AIs will always be wise to such attacks. We’ve shown that this isn’t necessarily the case: systems can simultaneously surpass

... (truncated, 61 KB total)
Resource ID: dd0f6ff1b958bd49 | Stable ID: YTQ1MzQ2NG