Skip to content
Longterm Wiki

ForecastBench: Dynamic LLM Forecasting Benchmark

web
forecastbench.org·forecastbench.org

ForecastBench is a dynamic benchmark measuring LLM forecasting accuracy against human baselines, relevant to AI safety as forecasting ability serves as a proxy for general intelligence and helps track AI capability progress toward and beyond human-level performance.

Metadata

Importance: 62/100tool pagetool

Summary

ForecastBench is a contamination-free benchmark that evaluates LLM forecasting accuracy against human comparison groups, including superforecasters. It maintains both a baseline leaderboard (no tools) and a tournament leaderboard (with scaffolding/tools), and projects when LLMs will reach superforecaster-level performance.

Key Points

  • Dynamic, contamination-free benchmark preventing LLMs from training on benchmark questions, ensuring valid capability measurement.
  • Compares LLM forecasting performance against human baselines including superforecasters as a proxy for general intelligence.
  • Dual leaderboards: baseline (raw model performance) and tournament (with tool use, fine-tuning, ensembling).
  • Tracks historical progress in LLM forecasting capabilities and projects date of LLM-superforecaster parity.
  • Open to public submissions, enabling broad participation in capability evaluation.

Cited by 2 pages

PageTypeQuality
Forecasting Research Institute (FRI)Organization55.0
ForecastBenchProject53.0

1 FactBase fact citing this source

EntityPropertyValueAs Of
ForecastBenchFounded DateSep 2024

Cached Content Preview

HTTP 200Fetched May 17, 20264 KB
# ForecastBench

A dynamic, contamination-free benchmark of LLM forecasting accuracy with human comparison groups, serving as a valuable proxy for general intelligence.

[Featured\\
\\
Scoring with the\\
\\
Brier Index\\
\\
Mar 4, 2026](https://forecastingresearch.substack.com/p/introducing-the-brier-index)

# Tournament leaderboard

Tracks frontier accuracy by allowing tool use to improve LLM performance. Models can be scaffolded, fine-tuned, ensembled, and so on. Open to [public submissions](https://github.com/forecastingresearch/forecastbench/wiki/How-to-submit-to-ForecastBench).

[Tournament leaderboard](https://forecastbench.org/leaderboards/)

| Rank | Org | Model | Overall |
| --- | --- | --- | --- |
| 1 | ![ForecastBench](https://forecastbench.org/assets/images/org_logos/fri.png) | Superforecaster median forecast | 70.6 |
| 2 | ![Google DeepMind](https://forecastbench.org/assets/images/org_logos/deepmind.svg) | green tree | 68.2 |
| 3 | ![Google DeepMind](https://forecastbench.org/assets/images/org_logos/deepmind.svg) | yellow mouse | 67.8 |
| 4 | ![xAI](https://forecastbench.org/assets/images/org_logos/xai.svg) | Grok 4.20 (Preview) | 67.6 |
| 5 | ![Cassi-AI](https://forecastbench.org/assets/images/org_logos/cassi-ai.png) | Cassi ensemble\_2\_crowdadj | 67.5 |
| 6 | ![OpenAI](https://forecastbench.org/assets/images/org_logos/openai.svg) | GPT-5-2025-08-07 (zero shot with crowd forecast) | 67.0 |
| 6 | ![Google](https://forecastbench.org/assets/images/org_logos/deepmind.svg) | Gemini-3-Pro-Preview (zero shot with crowd forecast) | 67.0 |
| 6 | ![Lightning Rod Labs](https://forecastbench.org/assets/images/org_logos/lightningrod.jpg) | Foresight-32B | 67.0 |
| 9 | ![xAI](https://forecastbench.org/assets/images/org_logos/xai.svg) | Grok-4-0709 (zero shot with crowd forecast) | 66.8 |
| 10 | ![Anthropic](https://forecastbench.org/assets/images/org_logos/anthropic.svg) | Claude-3-7-Sonnet-20250219 (scratchpad with crowd forecast) | 66.7 |

| Rank | Org | Model | Overall |
| --- | --- | --- | --- |
| 1 | ![ForecastBench](https://forecastbench.org/assets/images/org_logos/fri.png) | Superforecaster median forecast | 70.6 |
| 2 | ![ForecastBench](https://forecastbench.org/assets/images/org_logos/fri.png) | Public median forecast | 65.0 |
| 3 | ![OpenAI](https://forecastbench.org/assets/images/org_logos/openai.svg) | O3-2025-04-16 (scratchpad) | 64.9 |
| 4 | ![Anthropic](https://forecastbench.org/assets/images/org_logos/anthropic.svg) | Claude-3-7-Sonnet-20250219 (scratchpad) | 64.4 |
| 5 | ![xAI](https://forecastbench.org/assets/images/org_logos/xai.svg) | Grok-beta (zero shot) | 64.0 |
| 6 | ![OpenAI](https://forecastbench.org/assets/images/org_logos/openai.svg) | GPT-4.5-Preview-2025-02-27 (zero shot) | 63.9 |
| 6 | ![Google](https://forecastbench.org/assets/images/org_logos/deepmind.svg) | Gemini-2.5-Pro-Preview-03-25 (zero shot) | 63.9 |
| 8 | ![OpenAI](https://forecastbench.org/assets/images/org_logos/openai.svg) | O4-Mini-2025-04-16 (zero sh

... (truncated, 4 KB total)
Resource ID: kb-c808dd961e2e3c1d | Stable ID: sid_WNgUtLr8jP