Skip to content
Longterm Wiki
Back

confidence escalation in debates

paper

Authors

Pradyumna Shyama Prasad·Minh Nhat Nguyen

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Directly relevant to AI safety concerns about autonomous agents relying on self-reported confidence scores; demonstrates that LLMs cannot reliably self-assess or update beliefs in multi-turn adversarial settings, a critical failure mode for agentic deployments like coding agents.

Paper Details

Citations
1
0 influential
Year
2025
Methodology
survey

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models' private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLMs are now increasingly deployed without careful review in assistant and agentic roles. Code for our experiments is available at https://github.com/pradyuprasad/llms_overconfidence

Summary

This paper demonstrates systematic confidence miscalibration in LLMs during adversarial debates: models start with 72.9% average win confidence (vs. rational 50%), escalate to 83% by the final round, and in 61.7% of debates both sides simultaneously claim ≥75% victory probability—a logical impossibility. The findings reveal fundamental failures in belief updating and self-assessment across ten state-of-the-art LLMs, with implications for agentic and assistant deployments.

Key Points

  • LLMs exhibit systematic overconfidence from the start of debates (72.9% avg initial confidence vs. rational 50% baseline), suggesting poor prior calibration.
  • Confidence escalates rather than adjusts during debates, reaching 83% by round 3 despite receiving counterarguments—models fail to update beliefs under opposition.
  • In 61.7% of debates, both sides simultaneously claimed ≥75% win probability, a logical impossibility indicating mutual overestimation is widespread.
  • Even when debating identical copies or explicitly told win probability is 50%, models still increased confidence, ruling out task uncertainty as an explanation.
  • Models' private chain-of-thought reasoning sometimes diverged from their stated confidence ratings, raising concerns about faithfulness of internal reasoning.

Cited by 1 page

PageTypeQuality
Scalable OversightResearch Area68.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
[License: CC BY 4.0](https://info.arxiv.org/help/license/index.html#licenses-available)

arXiv:2505.19184v2 \[cs.CL\] 27 May 2025

# When Two LLMs Debate, Both Think They’ll Win

Report issue for preceding element

Pradyumna Shyama Prasad∗

School of Computing

National University of Singapore

pradyumna.prasad@u.nus.edu

&Minh Nhat Nguyen∗

Independent

minh1228@gmail.com

Report issue for preceding element

###### Abstract

Report issue for preceding element

Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1)Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed ≥\\geq≥75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models’ private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLM outputs are deployed without careful review in assistant roles or agentic settings.

Report issue for preceding element

11footnotetext: \*Equal contribution

## 1 Introduction

Report issue for preceding element

Large language models (LLMs) are increasingly deployed in complex domains requiring critical thinking and reasoning under uncertainty, such as coding and research (Handa et al., [2025](https://arxiv.org/html/2505.19184v2#bib.bib8 ""); Zheng et al., [2025](https://arxiv.org/html/2505.19184v2#bib.bib40 "")). A foundational requirement is calibration—aligning confidence with correctness. Poorly calibrated LLMs create risks: In assistant roles, users may accept incorrect but confidently-stated legal analysis without verification, especially in domains where they lack expertise, while in agentic settings, autonomous

... (truncated, 98 KB total)
Resource ID: 1b51b366182d416d | Stable ID: NTQzN2FlYm