Skip to content
Longterm Wiki
Back

scalable oversight via recursive self-critiquing

paper

Authors

Xueru Wen·Jie Lou·Xinyu Lu·Junjie Yang·Yanjiang Liu·Yaojie Lu·Debing Zhang·Xing Yu

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to the scalable oversight problem: how to supervise AI systems whose outputs humans cannot reliably evaluate directly; this paper offers a recursive critique mechanism as a concrete technical proposal, published at ICML.

Paper Details

Citations
4
0 influential
Year
2025

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques including SFT and RLHF face fundamental challenges in ensuring reliable oversight. These methods rely on direct human assessment and become untenable when AI outputs exceed human cognitive thresholds. In response to this challenge, we explore two hypotheses: (1) \textit{Critique of critique can be easier than critique itself}, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation; (2) \textit{This difficulty relationship is recursively held}, suggesting that when direct evaluation is infeasible, performing high-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway. We further conduct Human-AI and AI-AI experiments to investigate the potential of utilizing recursive self-critiquing for AI supervision. Our results highlight recursive critique as a promising approach for scalable AI oversight.

Summary

This paper proposes recursive self-critiquing as a scalable oversight mechanism for superhuman AI, arguing that critiquing a critique is easier than direct evaluation—analogous to verification being easier than generation. Human-Human, Human-AI, and AI-AI experiments support the hypothesis that higher-order critiques provide progressively more tractable supervision pathways when direct human oversight becomes infeasible.

Key Points

  • Extends the verification-easier-than-generation principle to the critique domain: critiquing a critique is more tractable than direct evaluation of complex outputs.
  • Proposes two core hypotheses: (1) critique of critique is easier than critique itself, and (2) this difficulty reduction holds recursively for higher-order critiques.
  • Conducts Human-Human, Human-AI, and AI-AI experiments to empirically validate recursive self-critiquing as a supervision mechanism.
  • Addresses a fundamental limitation of SFT and RLHF: both fail when AI outputs exceed human cognitive thresholds for reliable assessment.
  • Presents recursive self-critiquing as a concrete pathway toward scalable alignment for systems operating beyond human-level competence.

Cited by 1 page

PageTypeQuality
AI AlignmentApproach91.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
[License: CC BY-SA 4.0](https://info.arxiv.org/help/license/index.html#licenses-available)

arXiv:2502.04675v4 \[cs.AI\] 15 Jan 2026

# Scalable Oversight for Superhuman AI via Recursive Self-Critiquing

Report issue for preceding element

Xueru Wen
Jie Lou
Xinyu Lu
Junjie Yang
Yanjiang Liu
Yaojie Lu
Debing Zhang
XingYu

Report issue for preceding element

###### Abstract

Report issue for preceding element

As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques, including SFT and RLHF, face fundamental challenges in ensuring reliable oversight.
These methods rely on direct human assessment and become impractical when AI outputs exceed human cognitive thresholds.
In response to this challenge, we explore two hypotheses:
(1) Critique of critique can be easier than critique itself, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation;
(2) This difficulty relationship holds recursively, suggesting that when direct evaluation is infeasible, performing higher-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway.
We conduct Human-Human, Human-AI, and AI-AI experiments to investigate the potential of recursive self-critiquing for AI supervision.
Our results highlight recursive critique as a promising approach for scalable AI oversight.

Report issue for preceding element

Machine Learning, ICML

## 1 Introduction

Report issue for preceding element

Supervision signals are fundamental to AI alignment (Bowman et al., [2022](https://arxiv.org/html/2502.04675v4#bib.bib31 "Measuring progress on scalable oversight for large language models")), providing the ground truth or preference data necessary to train models that behave in accordance with human expectations.
The nature and accessibility of these supervision signals, however, vary substantially across different application domains.
From a supervision acquisition perspective, tasks can be categorized into two types:
(1) tasks with well-defined criteria, where ground truth can be deterministically obtained with low computational overhead, e.g., Go games and mathematical problems (Silver et al., [2017](https://arxiv.org/html/2502.04675v4#bib.bib1 "Mastering chess and shogi by self-play with a general reinforcement learning algorithm"); Lightman et al., [2023](https://arxiv.org/html/2502.04675v4#bib.bib2 "Let’s verify step by step"));
(2) tasks involving subjectivity or complex evaluation frameworks, such as business strategy and product design (Ouyang et al., [2022](https://arxiv.org/html/2502.04675v4#bib.bib3 "Training language models to follow instructions with human feedback")).
The latter type is more prevalent in real-world applications and predominantly relies on human assessment, presenting a fundamental challenge.

Report issue for preceding element

Current alignment techniques, particularly

... (truncated, 98 KB total)
Resource ID: 6d1732ab914da313 | Stable ID: OGUyNzE3Yz