confidence escalation in debates

paper

2025·arXiv·arxiv.org/html/2505.19184v2

Authors

Pradyumna Shyama Prasad·Minh Nhat Nguyen

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Directly relevant to AI safety concerns about autonomous agents relying on self-reported confidence scores; demonstrates that LLMs cannot reliably self-assess or update beliefs in multi-turn adversarial settings, a critical failure mode for agentic deployments like coding agents.

Paper Details

Citations

0 influential

Year

2025

Methodology

survey

arXiv:2505.19184 DOI:10.48550/arXiv.2505.19184 Semantic Scholar

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models' private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLMs are now increasingly deployed without careful review in assistant and agentic roles. Code for our experiments is available at https://github.com/pradyuprasad/llms_overconfidence

Summary

This paper demonstrates systematic confidence miscalibration in LLMs during adversarial debates: models start with 72.9% average win confidence (vs. rational 50%), escalate to 83% by the final round, and in 61.7% of debates both sides simultaneously claim ≥75% victory probability—a logical impossibility. The findings reveal fundamental failures in belief updating and self-assessment across ten state-of-the-art LLMs, with implications for agentic and assistant deployments.

Key Points

•LLMs exhibit systematic overconfidence from the start of debates (72.9% avg initial confidence vs. rational 50% baseline), suggesting poor prior calibration.
•Confidence escalates rather than adjusts during debates, reaching 83% by round 3 despite receiving counterarguments—models fail to update beliefs under opposition.
•In 61.7% of debates, both sides simultaneously claimed ≥75% win probability, a logical impossibility indicating mutual overestimation is widespread.
•Even when debating identical copies or explicitly told win probability is 50%, models still increased confidence, ruling out task uncertainty as an explanation.
•Models' private chain-of-thought reasoning sometimes diverged from their stated confidence ratings, raising concerns about faithfulness of internal reasoning.

Cited by 1 page

Page	Type	Quality
Scalable Oversight	Research Area	68.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

When Two LLMs Debate, Both Think They’ll Win 
 
 
 
 
 
 

 
 

 
 
 
 
 When Two LLMs Debate, Both Think They’ll Win

 
 
 Pradyumna Shyama Prasad ∗ 
 School of Computing 
 National University of Singapore 
 pradyumna.prasad@u.nus.edu 
 &Minh Nhat Nguyen ∗ 
 Independent 
 minh1228@gmail.com 
 
 
 
 
 Abstract

 Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence : models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation : rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation : in 61.7% of debates, both sides simultaneously claimed ≥ \geq ≥ 75% probability of victory, a logical impossibility. (4) Persistent self-debate bias : models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning : models’ private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLM outputs are deployed without careful review in assistant roles or agentic settings.

 
 1 1 footnotetext: *Equal contribution 
 
 
 1 Introduction

 
 Large language models (LLMs) are increasingly deployed in complex domains requiring critical thinking and reasoning under uncertainty, such as coding and research (Handa et al., 2025 ; Zheng et al., 2025 ) . A foundational requirement is calibration—aligning confidence with correctness. Poorly calibrated LLMs create risks: In assistant roles , users may accept incorrect but confidently-stated legal analysis without verification, especially in domains where they lack expertise, while in agentic settings , autonomous coding and research agents may persist with flawed reasoning paths with increasing confidence despite contradictory evidence. For example, Cognition Labs recently released Devin 2.1, a coding agent that relies on a 0-100 Confidence Score (Labs, 2025 ) 

 
 
 In this work, we study how well LLMs revise t

... (truncated, 98 KB total)

Resource ID: 1b51b366182d416d | Stable ID: sid_UP9W9uxqqn