Skip to content
Longterm Wiki
Back

Scaling Laws For Scalable Oversight

paper

Authors

Joshua Engels·David D. Baek·Subhash Kantamneni·Max Tegmark

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A technical paper directly relevant to the scalable oversight research agenda; useful for researchers evaluating which alignment techniques are likely to remain effective as AI capabilities scale.

Paper Details

Citations
5
0 influential
Year
2025

Metadata

Importance: 78/100arxiv preprintprimary source

Abstract

Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen. Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight-specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: Mafia, Debate, Backdoor Code and Wargames. For each game, we find scaling laws that approximate how domain performance depends on general AI system capability. We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. We also apply our theory to our four oversight games, where we find that NSO success rates at a general Elo gap of 400 are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.

Summary

This paper investigates empirical scaling laws governing scalable oversight techniques—including debate, recursive reward modeling, and process supervision—examining how their effectiveness changes as model capabilities and oversight resources scale. It aims to characterize under what conditions scalable oversight methods can maintain alignment guarantees as AI systems become more capable.

Key Points

  • Derives empirical scaling laws describing how oversight quality changes with model size and compute for methods like debate and recursive reward modeling.
  • Examines whether scalable oversight techniques can keep pace with capability improvements, a core concern for aligning superhuman AI systems.
  • Compares multiple oversight paradigms (debate, process supervision, RRM) under a unified scaling framework to identify relative strengths and failure modes.
  • Provides evidence about whether oversight methods degrade, improve, or plateau relative to the capability frontier as scale increases.
  • Has implications for which oversight strategies are most promising to invest in prior to the development of highly capable AI systems.

Cited by 2 pages

PageTypeQuality
Why Alignment Might Be HardArgument69.0
Scalable OversightResearch Area68.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
HTML conversions [sometimes display errors](https://info.dev.arxiv.org/about/accessibility_html_error_messages.html) due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

- failed: breqn

Authors: achieve the best HTML results from your LaTeX submissions by following these [best practices](https://info.arxiv.org/help/submit_latex_best_practices.html).

[License: CC BY 4.0](https://info.arxiv.org/help/license/index.html#licenses-available)

arXiv:2504.18530v2 \[cs.AI\] 09 May 2025

# Scaling Laws For Scalable Oversight

Report issue for preceding element

Joshua Engels

MIT

jengels@mit.edu

&David D. Baek11footnotemark: 1

MIT

dbaek@mit.edu

&Subhash Kantamneni11footnotemark: 1

MIT

subhashk@mit.edu

&Max Tegmark

MIT

tegmark@mit.edu

Equal contribution

Report issue for preceding element

###### Abstract

Report issue for preceding element

Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales.
To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen.
Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight-specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: Mafia, Debate, Backdoor Code and Wargames. For each game, we find scaling laws that approximate how domain performance depends on general AI system capability. We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. We also apply our theory to our four oversight games, where we find that NSO success rates at a general Elo gap of 400 are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.

Report issue for preceding element

## 1 Introduction

Report issue for preceding element

Many frontier AI companies are rapidly advancing toward their stated goal of building artificial general intelligence (AGI) and beyond. This has intensified interest in techniques for ensuring that such systems remain contro

... (truncated, 98 KB total)
Resource ID: 08c92819cc0fc2dd | Stable ID: NmZhMjNiMT