Scaling Laws For Scalable Oversight

paper

2025·arXiv·arxiv.org/abs/2504.18530

Authors

Joshua Engels·David D. Baek·Subhash Kantamneni·Max Tegmark

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Proposes a quantitative framework for understanding scalable oversight, a key AI safety strategy where weaker systems supervise stronger ones, modeling oversight success as a function of capability differences.

Paper Details

Citations

0 influential

Year

2025

Metadata

arxiv preprintprimary source

Abstract

Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen. Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight-specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: Mafia, Debate, Backdoor Code and Wargames. For each game, we find scaling laws that approximate how domain performance depends on general AI system capability. We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. We also apply our theory to our four oversight games, where we find that NSO success rates at a general Elo gap of 400 are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.

Summary

This paper addresses a critical gap in AI safety by developing a framework to quantify how well scalable oversight—where weaker AI systems supervise stronger ones—actually scales. The authors model oversight as a game between capability-mismatched players with Elo-based scoring functions, validate their framework on Nim and four oversight games (Mafia, Debate, Backdoor Code, Wargames), and derive scaling laws for oversight success. They then analyze Nested Scalable Oversight (NSO), where trusted models progressively oversee stronger untrusted models, identifying conditions for success and optimal oversight levels. Their results show NSO success rates vary significantly by task (13.5% for Mafia to 51.7% for Debate at a 400-point Elo gap) and decline substantially when overseeing even stronger systems.

Cited by 1 page

Page	Type	Quality
Alignment Robustness Trajectory Model	Analysis	64.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2504.18530] Scaling Laws For Scalable Oversight 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Scaling Laws For Scalable Oversight 

 
 
 Joshua Engels 
 MIT
 jengels@mit.edu 
 &David D. Baek 1 1 footnotemark: 1 
 MIT 
 dbaek@mit.edu
 &Subhash Kantamneni 1 1 footnotemark: 1 
 MIT 
 subhashk@mit.edu
 &Max Tegmark 
 MIT 
 tegmark@mit.edu
 
 Equal contribution 
 

 
 Abstract

 Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales.
To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen.
Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight -specific and deception -specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: “Mafia", “Debate", “Backdoor Code" and “Wargames". For each game, we find scaling laws that approximate how domain performance depends on general AI system capability (using Chatbot Arena Elo as a proxy for general capability). We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success.
In our numerical examples, the NSO success rate is below 52% when overseeing systems that are 400 Elo points stronger than the baseline overseer, and it declines further for overseeing even stronger systems.

 
 
 
 1 Introduction

 
 Many frontier AI companies are rapidly advancing toward their stated goal of building artificial general intelligence (AGI) and beyond. This has intensified interest in techniques for ensuring that such systems remain controllable and behave in beneficial ways.
One major cluster of such techniques includes Recursive Reward Modeling (Leike et al., 2018 ) , Iterated Amplification (Christiano et al., 2018 ) , Scalable Oversight (Bowman et al., 2022 ) , Weak-to-Strong Generalization (Burns et al., 2023 ) , Hierarchical Supervision (Shah et al., 2025 ) , and Recursive Oversight (Anthropic Alignment Science Team, 2025 ) . These methods share a central goal: enabling weaker systems to oversee stronger ones (weak-to-strong oversight), ultimately enabling humans to oversee systems with superhuman cognitive abilities. A key idea is that scalable oversight can be bootstrapped: weaker systems oversee stronger ones, which then 

... (truncated, 98 KB total)

Resource ID: 48511d731320244b | Stable ID: sid_h1x4Ckk4ad