Improving Weak-to-Strong with Scalable Oversight

paper

2024·arXiv·arxiv.org/abs/2402.00667

Authors

Jitao Sang·Yuhang Wang·Jing Zhang·Yanxu Zhu·Chao Kong·Junhong Ye·Shuyu Wei·Jinlin Xiao

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A research paper addressing superalignment through weak-to-strong generalization, proposing scalable oversight methods to ensure AI systems remain aligned with human values as they become superhuman.

Paper Details

Citations

3 influential

Year

2024

Methodology

peer-reviewed

Metadata

arxiv preprintprimary source

Abstract

This paper presents a follow-up study to OpenAI's recent superalignment work on Weak-to-Strong Generalization (W2SG). Superalignment focuses on ensuring that high-level AI systems remain consistent with human values and intentions when dealing with complex, high-risk tasks. The W2SG framework has opened new possibilities for empirical research in this evolving field. Our study simulates two phases of superalignment under the W2SG framework: the development of general superhuman models and the progression towards superintelligence. In the first phase, based on human supervision, the quality of weak supervision is enhanced through a combination of scalable oversight and ensemble learning, reducing the capability gap between weak teachers and strong students. In the second phase, an automatic alignment evaluator is employed as the weak supervisor. By recursively updating this auto aligner, the capabilities of the weak teacher models are synchronously enhanced, achieving weak-to-strong supervision over stronger student models.We also provide an initial validation of the proposed approach for the first phase. Using the SciQ task as example, we explore ensemble learning for weak teacher models through bagging and boosting. Scalable oversight is explored through two auxiliary settings: human-AI interaction and AI-AI debate. Additionally, the paper discusses the impact of improved weak supervision on enhancing weak-to-strong generalization based on in-context learning. Experiment code and dataset will be released at https://github.com/ADaM-BJTU/W2SG.

Summary

This paper extends OpenAI's Weak-to-Strong Generalization (W2SG) framework for superalignment by proposing methods to improve weak supervision across two phases: developing superhuman models and progressing toward superintelligence. The authors enhance weak supervision quality through scalable oversight techniques (human-AI interaction and AI-AI debate) combined with ensemble learning, reducing the capability gap between weak teachers and strong students. In the second phase, they employ an automatic alignment evaluator as a weak supervisor that recursively updates to maintain alignment as student models become stronger. Initial validation on the SciQ task demonstrates the effectiveness of ensemble methods and scalable oversight approaches.

Cited by 1 page

Page	Type	Quality
AI Alignment	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202666 KB

[2402.00667] Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Improving Weak-to-Strong Generalization with 
 Scalable Oversight and Ensemble Learning

 
 
 Jitao Sang 
 These authors contributed equally to this work.Corresponding author: jtsang@bjtu.edu.cn
 Department of Computer Science 
 Beijing Jiaotong University
 Beijing, China 100044
 
 
 Yuhang Wang 1 1 footnotemark: 1 
 
 Department of Computer Science 
 Beijing Jiaotong University
 Beijing, China 100044
 
 
 Jing Zhang 
 
 Department of Computer Science 
 Beijing Jiaotong University
 Beijing, China 100044
 
 
 Yanxu Zhu 
 
 Department of Computer Science 
 Beijing Jiaotong University
 Beijing, China 100044
 
 
 Chao Kong 
 
 Department of Computer Science 
 Beijing Jiaotong University
 Beijing, China 100044
 
 
 
 Junhong Ye 
 
 Department of Computer Science 
 Beijing Jiaotong University
 Beijing, China 100044
 
 
 Shuyu Wei 
 
 Department of Computer Science 
 Beijing Jiaotong University
 Beijing, China 100044
 
 
 Jinlin Xiao 
 
 Department of Computer Science 
 Beijing Jiaotong University
 Beijing, China 100044
 
 

 
 Abstract

 This paper presents a follow-up study to OpenAI’s recent superalignment work on Weak-to-Strong Generalization (W2SG). Superalignment focuses on ensuring that high-level AI systems remain consistent with human values and intentions when dealing with complex, high-risk tasks. The W2SG framework has opened new possibilities for empirical research in this evolving field.

 Our study simulates two phases of superalignment under the W2SG framework: the development of general superhuman models and the progression towards superintelligence. In the first phase, based on human supervision, the quality of weak supervision is enhanced through a combination of scalable oversight and ensemble learning, reducing the capability gap between weak teachers and strong students. In the second phase, an automatic alignment evaluator is employed as the weak supervisor. By recursively updating this auto aligner, the capabilities of the weak teacher models are synchronously enhanced, achieving weak-to-strong supervision over stronger student models.

 We also provide an initial validation of the proposed approach for the first phase. Using the SciQ task as example, we explore ensemble learning for weak teacher models through bagging and boosting. Scalable oversight is explored through two auxiliary settings: human-AI interaction and AI-AI debate. Additionally, the paper discusses the impact of improved weak supervision on enhancing weak-to-strong generalization based on in-context learning.
Experiment code and dataset will be released at https://github.com/ADaM-BJTU/W2SG .

 
 
 
 1 Introduction

 
 Reinforcement Learning from Human Feedback (RLHF) is the pivotal solution today in aligning powerful AI models with human intentions and values. It has demonstrated considerable success in aligning advanced la

... (truncated, 66 KB total)

Resource ID: e4fb663747c74f50 | Stable ID: sid_58UC11XfED