ForecastBench: Dynamic LLM Forecasting Benchmark
webForecastBench is a dynamic benchmark measuring LLM forecasting accuracy against human baselines, relevant to AI safety as forecasting ability serves as a proxy for general intelligence and helps track AI capability progress toward and beyond human-level performance.
Metadata
Importance: 62/100tool pagetool
Summary
ForecastBench is a contamination-free benchmark that evaluates LLM forecasting accuracy against human comparison groups, including superforecasters. It maintains both a baseline leaderboard (no tools) and a tournament leaderboard (with scaffolding/tools), and projects when LLMs will reach superforecaster-level performance.
Key Points
- •Dynamic, contamination-free benchmark preventing LLMs from training on benchmark questions, ensuring valid capability measurement.
- •Compares LLM forecasting performance against human baselines including superforecasters as a proxy for general intelligence.
- •Dual leaderboards: baseline (raw model performance) and tournament (with tool use, fine-tuning, ensembling).
- •Tracks historical progress in LLM forecasting capabilities and projects date of LLM-superforecaster parity.
- •Open to public submissions, enabling broad participation in capability evaluation.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Forecasting Research Institute (FRI) | Organization | 55.0 |
| ForecastBench | Project | 53.0 |
1 FactBase fact citing this source
| Entity | Property | Value | As Of |
|---|---|---|---|
| ForecastBench | Founded Date | Sep 2024 | — |
Cached Content Preview
HTTP 200Fetched May 17, 20264 KB
# ForecastBench
A dynamic, contamination-free benchmark of LLM forecasting accuracy with human comparison groups, serving as a valuable proxy for general intelligence.
[Featured\\
\\
Scoring with the\\
\\
Brier Index\\
\\
Mar 4, 2026](https://forecastingresearch.substack.com/p/introducing-the-brier-index)
# Tournament leaderboard
Tracks frontier accuracy by allowing tool use to improve LLM performance. Models can be scaffolded, fine-tuned, ensembled, and so on. Open to [public submissions](https://github.com/forecastingresearch/forecastbench/wiki/How-to-submit-to-ForecastBench).
[Tournament leaderboard](https://forecastbench.org/leaderboards/)
| Rank | Org | Model | Overall |
| --- | --- | --- | --- |
| 1 |  | Superforecaster median forecast | 70.6 |
| 2 |  | green tree | 68.2 |
| 3 |  | yellow mouse | 67.8 |
| 4 |  | Grok 4.20 (Preview) | 67.6 |
| 5 |  | Cassi ensemble\_2\_crowdadj | 67.5 |
| 6 |  | GPT-5-2025-08-07 (zero shot with crowd forecast) | 67.0 |
| 6 |  | Gemini-3-Pro-Preview (zero shot with crowd forecast) | 67.0 |
| 6 |  | Foresight-32B | 67.0 |
| 9 |  | Grok-4-0709 (zero shot with crowd forecast) | 66.8 |
| 10 |  | Claude-3-7-Sonnet-20250219 (scratchpad with crowd forecast) | 66.7 |
| Rank | Org | Model | Overall |
| --- | --- | --- | --- |
| 1 |  | Superforecaster median forecast | 70.6 |
| 2 |  | Public median forecast | 65.0 |
| 3 |  | O3-2025-04-16 (scratchpad) | 64.9 |
| 4 |  | Claude-3-7-Sonnet-20250219 (scratchpad) | 64.4 |
| 5 |  | Grok-beta (zero shot) | 64.0 |
| 6 |  | GPT-4.5-Preview-2025-02-27 (zero shot) | 63.9 |
| 6 |  | Gemini-2.5-Pro-Preview-03-25 (zero shot) | 63.9 |
| 8 |  | O4-Mini-2025-04-16 (zero sh
... (truncated, 4 KB total)Resource ID:
kb-c808dd961e2e3c1d | Stable ID: sid_WNgUtLr8jP