Skip to content
Longterm Wiki
Back

Scalable Oversight | AI Alignment

web

Part of an alignment survey learning curriculum; a useful introductory overview of scalable oversight for those new to the technical AI safety field, connecting foundational concepts like debate and amplification.

Metadata

Importance: 65/100wiki pageeducational

Summary

This educational resource covers scalable oversight as a key approach to AI alignment, addressing how humans can effectively supervise AI systems that may surpass human capabilities in certain domains. It explores techniques like debate, amplification, and recursive reward modeling to maintain meaningful human control as AI systems scale.

Key Points

  • Scalable oversight addresses the core challenge of supervising AI systems whose outputs or reasoning humans cannot directly evaluate with confidence.
  • Key techniques include debate (AI systems argue positions for human judgment), iterated amplification, and recursive reward modeling.
  • The problem becomes critical as AI systems grow more capable, since naive human feedback may be insufficient for complex or superhuman tasks.
  • Scalable oversight research aims to ensure training signals remain aligned with human values even when humans cannot directly verify correctness.
  • Part of a broader AI alignment survey curriculum, situating scalable oversight within the landscape of technical safety research.

Cited by 1 page

PageTypeQuality
RLHFResearch Area63.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202632 KB
Learning from Feedback
Search.../

- [Contents](https://alignmentsurvey.com/materials/learning/scalable/#)

# Scalable Oversight

Scalable oversight seeks to ensure that AI systems, even those surpassing human expertise, remain aligned with human intent.

We are Here.

[Reinforcement Learning from Feedback](https://alignmentsurvey.com/materials/learning/scalable/#reinforcement-learning-from-feedback)[Scalable Oversight](https://alignmentsurvey.com/materials/learning/scalable)[Iterated Distillation and Amplification](https://alignmentsurvey.com/materials/learning/scalable/#iterated-distillation-and-amplification)[Recursive Reward Modeling](https://alignmentsurvey.com/materials/learning/scalable/#recursive-reward-modeling)[Debate](https://alignmentsurvey.com/materials/learning/scalable/#debate)[CIRL: Cooperative Inverse Reinforcement Learning](https://alignmentsurvey.com/materials/learning/scalable/#cirl-cooperative-inverse-reinforcement-learning)[Feedback](https://alignmentsurvey.com/materials/learning/feedback)

Policy Learning

[Preference Modeling](https://alignmentsurvey.com/materials/learning/preference)

## Reinforcement Learning from Feedback [Anchor](https://alignmentsurvey.com/materials/learning/scalable/\#reinforcement-learning-from-feedback)

We propose the concept of RLxF as a naive form of Scalable Oversight, achieved by extending RLHF via AI components, improving the efficiency and quality of alignment between humans and AI systems.

### Reinforcement Learning from AI Feedback [Anchor](https://alignmentsurvey.com/materials/learning/scalable/\#reinforcement-learning-from-ai-feedback)

Reinforcement Learning from AI Feedback (RLAIF) is a method building upon the framework of RLHF and serves as an extension to [Reinforcement Learning from Human Feedback (RLHF)](https://alignmentsurvey.com/materials/learning/policy/#reinforcement-learning-from-human-feedback).

![](https://github.com/PKU-Alignment/omnisafe/assets/108712610/9cad02fa-bd2a-4477-baf2-17e5a48714f6)**We show the basic steps of our Constitutional AI (CAI) process, which consists of both a supervised learning (SL) stage, consisting of the steps at the top, and a Reinforcement Learning (RL) stage, shown as the sequence of steps at the bottom of the figure. Both the critiques and the AI feedback are steered by a small set of principles drawn from a ‘constitution’. The supervised stage significantly improves the initial model, and gives some control over the initial behavior at the start of the RL phase, addressing potential exploration problems. The RL stage significantly improves performance and reliability.**

Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)

**Recommended Papers List**

- [Constitutional AI: Harmlessness from ai feedback](https://arxiv.org/pdf/2212.08073.pdf?trk=public_post_comment-text)
Click to have a preview.

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for traini

... (truncated, 32 KB total)
Resource ID: be45866c43f2be82 | Stable ID: NjViZWNjNz