Quarterly competition (Q2 2025: 348 questions, 54 bot-makers, $30K prizes) comparing human Pro Forecasters against AI bots, with statistical testing showing humans maintain significant lead (p=0.00001) though AI improves ~24% Q3-Q4 2024. Best AI baseline is OpenAI's o3; top bot-makers are students/hobbyists; ensemble methods significantly improve performance.
AI Forecasting Benchmark Tournament
AI Forecasting Benchmark Tournament
Quarterly competition (Q2 2025: 348 questions, 54 bot-makers, $30K prizes) comparing human Pro Forecasters against AI bots, with statistical testing showing humans maintain significant lead (p=0.00001) though AI improves ~24% Q3-Q4 2024. Best AI baseline is OpenAI's o3; top bot-makers are students/hobbyists; ensemble methods significantly improve performance.
Quick Assessment
| Dimension | Assessment | Evidence |
|---|---|---|
| Scale | Large | 348 questions (Q2 2025), 54 bot-makers participating |
| Rigor | High | Statistical significance testing, standardized scoring (Peer score) |
| Competitive | Strong | $10K quarterly prizes, API credits, public leaderboard |
| Key Finding | Clear | Pro Forecasters significantly outperform AI (p = 0.00001), though gap narrowing |
| Industry Support | Robust | OpenAIOrganizationOpenAIComprehensive organizational profile of OpenAI documenting evolution from 2015 non-profit to commercial AGI developer, with detailed analysis of governance crisis, safety researcher exodus (75% of ... and AnthropicOrganizationAnthropicComprehensive profile of Anthropic, founded in 2021 by seven former OpenAI researchers (Dario and Daniela Amodei, Chris Olah, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish) with early funding... provide API credits |
| Practical Impact | Growing | Demonstrates current AI forecasting limitations and progress rate |
Tournament Details
| Attribute | Details |
|---|---|
| Name | AI Forecasting Benchmark Tournament |
| Abbreviation | AIB (or AI Benchmark) |
| Organization | MetaculusOrganizationMetaculusMetaculus is a reputation-based forecasting platform with 1M+ predictions showing AGI probability at 25% by 2027 and 50% by 2031 (down from 50 years away in 2020). Analysis finds good short-term ca...Quality: 50/100 |
| Launched | 2024 |
| Structure | 4-month seasonal tournament + bi-weekly MiniBench |
| Website | metaculus.com/aib/ |
| Prize Pool | $10,000 per quarter |
| Industry Partners | OpenAI (API credits), Anthropic (API credits) |
Overview
The AI Forecasting Benchmark Tournament represents MetaculusOrganizationMetaculusMetaculus is a reputation-based forecasting platform with 1M+ predictions showing AGI probability at 25% by 2027 and 50% by 2031 (down from 50 years away in 2020). Analysis finds good short-term ca...Quality: 50/100's flagship initiative for comparing human and AI forecasting capabilities. Launched in 2024, the tournament runs in two parallel series:
- Primary Seasonal Tournament: 4-month competitions with ~300-400 questions
- MiniBench: Bi-weekly fast-paced tournaments for rapid iteration
Participants can compete using API credits provided by OpenAI and Anthropic, encouraging experimentation with frontier LLMs. The tournament has become the premier benchmark for tracking AI progress on forecasting—a domain that requires integrating diverse information sources, reasoning under uncertainty, and calibrating confidence to match reality.
Structure
| Component | Duration | Question Count | Prize Pool |
|---|---|---|---|
| Seasonal Tournament | 4 months | ≈300-400 | $10,000 |
| MiniBench | 2 weeks | ≈20-30 | Varies |
Both components use Metaculus's Peer score metric, which compares forecasters to each other and equalizes for question difficulty, making performance comparison fair across different question sets.
Historical Results
Quarterly Performance Trajectory
| Quarter | Best Bot Performance | Gap to Pro Forecasters | Key Development |
|---|---|---|---|
| Q3 2024 | -11.3 | Large negative gap | Initial baseline |
| Q4 2024 | -8.6 | Moderate negative gap | ≈24% improvement |
| Q1 2025 | First place (metac-o1) | Narrowing gap | First bot to lead leaderboard |
| Q2 2025 | OpenAI o3 (baseline) | Statistical gap remains (p = 0.00001) | Humans maintain clear lead |
Note: Score of 0 = equal to comparison group. Negative scores mean underperformance relative to Pro Forecasters.
Q2 2025 Detailed Results
Q2 2025 tournament results provided key insights:
| Metric | Finding |
|---|---|
| Questions | 348 |
| Bot-makers | 54 |
| Statistical significance | Pro Forecasters lead at p = 0.00001 |
| Top bot-makers | Top 3 (excluding Metaculus in-house) were students or hobbyists |
| Aggregation effect | Taking median or mean of multiple forecasts improved scores significantly |
| Best baseline bot | OpenAI's o3 |
Key Findings
Pro Forecasters Maintain Lead
Despite rapid AI improvement, human expert forecasters remain statistically significantly better:
| Evidence | Interpretation |
|---|---|
| p = 0.00001 | Extremely strong statistical significance |
| Consistent across quarters | Not a fluke; reproducible result |
| Even best bots trail | Top AI systems still below human expert level |
Students and Hobbyists Competitive
The top 3 bot-makers (excluding Metaculus's in-house bots) in Q2 2025 were students or hobbyists, not professional AI researchers:
| Implication | Explanation |
|---|---|
| Low barrier to entry | API access + creativity > credentials |
| Forecasting as craft | Domain knowledge + prompt engineering matters more than ML expertise |
| Innovation from edges | Some of the best approaches come from non-traditional participants |
Aggregation Helps Significantly
Taking the median or mean of multiple LLM forecasts rather than single calls substantially improved scores:
| Method | Performance |
|---|---|
| Single LLM call | Baseline |
| Median of multiple calls | Significantly better |
| Mean of multiple calls | Significantly better |
This suggests that ensemble methods are critical for AI forecasting, similar to how aggregating multiple human forecasters improves accuracy.
Metaculus Community Prediction Remains Strong
The average Peer score for the Metaculus Community Prediction is 12.9, ranking in the top 10 on the global leaderboard over every 2-year period since 2016. This demonstrates that aggregated human forecasts remain world-class and provide a high bar for AI systems to match.
Technical Implementation
Bot Development Process
Participants develop forecasting bots using:
| Component | Description |
|---|---|
| API Access | OpenAI and Anthropic provide credits |
| Metaculus API | Fetch questions, submit forecasts |
| Prompt Engineering | Craft prompts that produce well-calibrated forecasts |
| Aggregation Logic | Combine multiple model calls or different models |
| Continuous Learning | Iterate based on quarterly feedback |
Scoring: Peer Score
Metaculus uses Peer score for fair comparison:
| Feature | Benefit |
|---|---|
| Relative comparison | Compares forecasters to each other, not absolute truth |
| Difficulty adjustment | Accounts for question hardness |
| Time-averaged | Rewards updating when new information emerges |
| Equalizes participation | Forecasters with different time constraints comparable |
Baseline Bots
Metaculus provides baseline bots for comparison:
| Bot | Method | Purpose |
|---|---|---|
| GPT-4o | Vanilla frontier LLM | Standard baseline |
| o3 | OpenAI's reasoning model | Best performance (Q2 2025) |
| Claude variants | Anthropic frontier models | Alternative baseline |
| Metaculus in-house | Custom implementations | Metaculus's own research |
Comparison with Other Projects
| Project | Organization | Focus | Structure | Scale |
|---|---|---|---|---|
| AI Forecasting Benchmark | Metaculus | Human vs AI | Quarterly tournaments | ≈350 questions/quarter |
| ForecastBenchProjectForecastBenchForecastBench is a dynamic, contamination-free benchmark with 1,000 continuously-updated questions comparing LLM forecasting to superforecasters. GPT-4.5 achieves 0.101 Brier score vs 0.081 for sup...Quality: 53/100 | FRI | AI benchmarking | Continuous evaluation | 1,000 questions |
| XPTProjectXPT (Existential Risk Persuasion Tournament)A 2022 forecasting tournament with 169 participants found superforecasters severely underestimated AI progress (2.3% probability for IMO gold vs actual 2025 achievement) and gave 8x lower AI extinc...Quality: 54/100 | FRI | Expert collaboration | One-time tournament | ≈100 questions |
| Good JudgmentOrganizationGood Judgment (Forecasting)Good Judgment Inc. is a commercial forecasting organization that emerged from successful IARPA research, demonstrating that trained 'superforecasters' can outperform intelligence analysts and predi...Quality: 50/100 | Good Judgment Inc | Superforecaster panels | Ongoing operations | Client-specific |
The AI Forecasting Benchmark's quarterly structure balances rapid iteration (faster than XPT) with sufficient time for meaningful comparison (longer than weekly competitions).
Industry Partnerships
OpenAI and Anthropic Support
Both frontier AI labs provide API credits to tournament participants:
| Benefit | Impact |
|---|---|
| Free experimentation | Lowers cost barrier for participants |
| Frontier model access | Ensures latest capabilities are tested |
| Corporate validation | Labs view forecasting as important benchmark |
| Data for research | Labs learn from bot performance patterns |
Implications for AI Development
Forecasting as Intelligence Proxy
The tournament provides empirical data on AI's ability to:
| Capability | Forecasting Relevance |
|---|---|
| Information integration | Combine diverse sources to estimate probabilities |
| Calibration | Match confidence to actual frequency of outcomes |
| Temporal reasoning | Project trends forward in time |
| Uncertainty quantification | Express degrees of belief numerically |
| Continuous learning | Update beliefs as new information emerges |
Near-Term Milestones
Based on current trajectory:
| Milestone | Estimated Timing | Significance |
|---|---|---|
| Bot equals median human | Achieved (Q1 2025) | AI matches casual forecasters |
| Bot equals Pro Forecaster | 2026-2027? | AI matches human experts |
| Bot exceeds Community Prediction | 2027-2028? | AI exceeds aggregated human wisdom |
These milestones serve as empirical indicators of AI reasoning progress.
Relationship to Metaculus Ecosystem
The AI Forecasting Benchmark integrates with Metaculus's broader platform:
| Component | Relationship to AI Benchmark |
|---|---|
| Pro Forecasters | Human comparison group |
| Community Prediction | Aggregated human baseline |
| AI 2027 Tournament | AI-specific questions for human forecasters |
| Track Record Page | Historical calibration data |
Use Cases
AI Research
Researchers use the tournament to:
- Benchmark new model architectures
- Test prompt engineering strategies
- Validate aggregation methods
- Track capability progress over time
Forecasting Methodology
The tournament informs:
- When to trust AI vs human forecasts
- How to combine AI and human forecasts
- Optimal ensemble strategies
- Calibration techniques
AI Safety
The tournament provides evidence for:
- Current AI reasoning capabilities
- Rate of AI capability progress
- Domains where AI still trails humans
- Potential for AI-assisted forecasting on x-risk questions
Strengths and Limitations
Strengths
| Strength | Evidence |
|---|---|
| Large scale | 300-400 questions per quarter |
| Real-time competition | Ongoing rather than one-time |
| Industry support | OpenAI and Anthropic API credits |
| Public leaderboard | Transparent comparison |
| Statistical rigor | Significance testing, controlled scoring |
| Accessible | Students/hobbyists competitive with professionals |
Limitations
| Limitation | Impact |
|---|---|
| Quarterly lag | Results only every 3-4 months |
| API cost dependency | Limits experimentation for some participants |
| Question selection | May not cover all important domains |
| Bot sophistication ceiling | Diminishing returns to complexity? |
| Human baseline variability | Pro Forecaster performance may change over time |
Funding
The tournament is supported by:
| Source | Type | Amount |
|---|---|---|
| Coefficient GivingOrganizationCoefficient GivingCoefficient Giving (formerly Open Philanthropy) has directed $4B+ in grants since 2014, including $336M to AI safety (~60% of external funding). The organization spent ~$50M on AI safety in 2024, w...Quality: 55/100 | Grant funding to Metaculus | $1.5M+ (2022-2023) |
| OpenAI | API credit sponsorship | Not disclosed |
| Anthropic | API credit sponsorship | Not disclosed |
| Prize Pool | Per quarter | $10,000 |
Total annual prize commitment: $120,000 (4 quarters × $10K).
Future Directions
Potential enhancements based on current trajectory:
| Enhancement | Benefit | Challenge |
|---|---|---|
| Increase question diversity | Test broader capabilities | Curation effort |
| Add multi-turn forecasting | Test updating based on new info | More complex protocol |
| Reason evaluation | Assess whether bots "understand" | Subjective judgment |
| Cross-tournament comparison | Link to ForecastBench, Good Judgment | Standardization |
| Adversarial questions | Test robustness | Question design |