Recent research on adversarial debate

paper

2024·arXiv·arxiv.org/abs/2409.16636

Authors

Samuel Arnesen·David Rein·Julian Michael

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A key empirical validation of AI debate as a scalable oversight method; directly relevant to researchers working on supervision techniques for advanced AI systems where human evaluation is bottlenecked by expertise or time.

Paper Details

Citations

0 influential

Year

2024

arXiv:2409.16636 DOI:10.48550/arXiv.2409.16636 Semantic Scholar

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. In a long-context reading comprehension task, we find that language model based evaluators answer questions more accurately when judging models optimized to win debates. By contrast, we find no such relationship for consultancy models trained to persuade a judge without an opposing debater present. In quantitative and qualitative comparisons between our debate models and novel consultancy baselines, we find evidence that debate training encourages stronger and more informative arguments, showing promise that it can help provide high-quality supervision for tasks that are difficult to directly evaluate.

Summary

This paper provides the first empirical demonstration that training language models to win debates via self-play and DPO improves evaluator accuracy on a long-context reading comprehension task. Debate-trained models produce stronger, more informative arguments compared to consultancy baselines, supporting debate as a viable scalable oversight mechanism for tasks difficult to evaluate directly.

Key Points

•First demonstration that training LMs to debate (via self-play + DPO) measurably improves judge accuracy, addressing a gap in prior training-based debate work.
•Debate-trained models outperform consultancy baselines (models trained to persuade without opposition), suggesting adversarial structure drives argument quality.
•Results are validated on a long-context reading comprehension task where direct evaluation is challenging, approximating real scalable oversight conditions.
•Develops a multi-turn variant of Direct Preference Optimization (DPO) specifically adapted for the debate training setup.
•Provides empirical evidence supporting Irving et al.'s (2018) theoretical debate framework, complementing prior human experiment and inference-time optimization results.

Cited by 1 page

Page	Type	Quality
Scalable Oversight	Research Area	68.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2409.16636] Training Language Models to Win Debates With Self-Play Improves Judge Accuracy 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Training Language Models to Win Debates With Self-Play Improves Judge Accuracy

 
 
 Samuel Arnesen, David Rein, and Julian Michael 
 Center for Data Science, New York University 
 {samuel.p.arnesen, idavidrein, julianjohnmichael}@gmail.com
 
 

 
 Abstract

 We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. In a long-context reading comprehension task, we find that language model based evaluators answer questions more accurately when judging models optimized to win debates. By contrast, we find no such relationship for consultancy models trained to persuade a judge without an opposing debater present. In quantitative and qualitative comparisons between our debate models and novel consultancy baselines, we find evidence that debate training encourages stronger and more informative arguments, showing promise that it can help provide high-quality supervision for tasks that are difficult to directly evaluate.

 
 
 
 1 Introduction

 
 As AI systems tackle increasingly challenging problems, it will become correspondingly more difficult for humans to verify their answers as safe, useful, and accurate. For example, confirming the solution to a graduate-level physics problem requires domain expertise, evaluating a literature review requires considerable time, and identifying a race condition in code requires careful reasoning, all of which a human may struggle with under practical time and resource constraints. As existing AI alignment and oversight approaches depend on reliable human supervision, we will need new interaction mechanisms and training protocols for scalable oversight   (Amodei et al., 2016 ; Bowman et al., 2022 ) , i.e., ones which scale with the increased complexity of the tasks being performed by state-of-the-art AI models.

 
 
 Debate, proposed as a scalable oversight method by Irving et al. ( 2018 ) , works by having two copies of a model argue against each other in defense of alternative responses to a question. A judge, who can be either a human or a weaker, trusted model, tries to discern which debater is defending the correct answer. In principle, debate should simplify evaluation by incentivizing the competing models to discover and explain the subtle flaws that a human or weaker model may not notice due to a lack of expertise, care, or time. As long as the refutational abilities of models scale alongside their general argumentation skills, we would expect that debates between more proficient models will yield more accurate judgments.

 
 
 Validating debate as an oversight paradigm requires showing this empirically. Existing work has produced promising results for debate in human experiments (Michael et al., 2023 ) and with inference-time optimization of frontier models (Khan et al., 2024 ; Kenton et al., 2024 ) , b

... (truncated, 98 KB total)

Resource ID: 5bf590d69438a2f2 | Stable ID: sid_VIvIyeu3Iv