[2407.04622] On scalable oversight with weak LLMs judging strong LLMs

paper

2024·arXiv·arxiv.org/abs/2407.04622

Authors

Zachary Kenton·Noah Y. Siegel·János Kramár·Jonah Brown-Cohen·Samuel Albanie·Jannis Bulian·Rishabh Agarwal·David Lindner·Yunhao Tang·Noah D. Goodman·Rohin Shah

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Empirical study of scalable oversight methods (debate and consultancy) using LLMs to evaluate whether weaker AI judges can effectively oversee stronger AI agents, directly addressing a key challenge in AI safety governance.

Paper Details

Citations

9 influential

Year

2024

Methodology

peer-reviewed

Metadata

arxiv preprintprimary source

Abstract

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.

Summary

This paper evaluates debate and consultancy as scalable oversight protocols for supervising superhuman AI systems. Using LLMs as both AI agents and judges, the researchers benchmark these approaches across diverse tasks including extractive QA, mathematics, coding, logic, and multimodal reasoning. They find that debate generally outperforms consultancy when debaters are randomly assigned positions, and that debate improves judge accuracy in information-asymmetric tasks. However, results are mixed when comparing debate to direct question-answering in tasks without information asymmetry, and stronger debater models show only modest improvements in judge accuracy.

Cited by 2 pages

Page	Type	Quality
AI Accident Risk Cruxes	Crux	67.0
AI Safety via Debate	Approach	70.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202698 KB

[2407.04622] On scalable oversight with weak LLMs judging strong LLMs 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 \pdftrailerid 
 redacted

 \correspondingauthor zkenton@google.com
 \reportnumber 001

 
 On scalable oversight with weak LLMs judging strong LLMs

 
 
 Zachary Kenton
 
 Equal contributions
 
 Google DeepMind
 
 
 Noah Y. Siegel
 
 Equal contributions
 
 Google DeepMind
 
 
 János Kramár
 
 Google DeepMind
 
 
 Jonah Brown-Cohen
 
 Google DeepMind
 
 
 Samuel Albanie
 
 Google DeepMind
 
 
 Jannis Bulian
 
 Google DeepMind
 
 
 Rishabh Agarwal
 
 Google DeepMind
 
 
 David Lindner
 
 Google DeepMind
 
 
 Yunhao Tang
 
 Google DeepMind
 
 
 Noah D. Goodman
 
 Google DeepMind
 
 
 Rohin Shah
 
 Google DeepMind
 
 

 
 Abstract

 Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI.
In this paper we study debate , where two AI’s compete to convince a judge; consultancy ,
where a single AI tries to convince a judge that asks questions;
and compare to a baseline of direct question-answering , where the judge just answers outright without the AI.
We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models.
We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries.
We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed.
Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy.
Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.

 
 
 
 1 Introduction

 
 Figure 1 : Our setup. We evaluate on three types of task (top row). Extractive , where there is a question, two answer options and a source article to extract from, and information-asymmetry, meaning that judges don’t get to see the article. Closed , where there is just a question and two answer options. Multimodal , where the questions involve both text and images, and two answer options.
We consider six protocols (middle and bottom rows): Consultancy , where a single AI is assigned the correct/incorrect answer (with probability 50/50) and tries to convince a judge that asks questions; Open consultancy , which is similar except the AI chooses which answer to argue for. Debate , where two AIs compete to convince a judge. Open deb

... (truncated, 98 KB total)

Resource ID: fe73170e9d8be64f | Stable ID: sid_vdx9TiEobE