Longterm Wiki
Updated 2026-01-29HistoryData
Page StatusContent
Edited 2 weeks ago1.0k words
42
QualityAdequate
42
ImportanceReference
13
Structure13/15
7110036%16%
Updated every 6 weeksDue in 4 weeks
Summary

Comprehensive survey of the 2024-2025 scaling debate, documenting the shift from pure pretraining to 'scaling-plus' approaches after o3 achieved 87.5% on ARC-AGI-1 but GPT-5 faced 2-year delays. Expert consensus has moved to ~45% favoring hybrid approaches, with data wall projected 2026-2030 and AGI timelines spanning 5-30+ years depending on paradigm.

Issues2
QualityRated 42 but structure suggests 87 (underrated by 45 points)
Links13 links could use <R> components

Is Scaling All You Need?

Crux

Is Scaling All You Need?

Comprehensive survey of the 2024-2025 scaling debate, documenting the shift from pure pretraining to 'scaling-plus' approaches after o3 achieved 87.5% on ARC-AGI-1 but GPT-5 faced 2-year delays. Expert consensus has moved to ~45% favoring hybrid approaches, with data wall projected 2026-2030 and AGI timelines spanning 5-30+ years depending on paradigm.

QuestionCan we reach AGI through scaling alone, or do we need new paradigms?
StakesDetermines AI timeline predictions and research priorities
Expert ConsensusStrong disagreement between scaling optimists and skeptics
1k words
Crux

The Scaling Debate

QuestionCan we reach AGI through scaling alone, or do we need new paradigms?
StakesDetermines AI timeline predictions and research priorities
Expert ConsensusStrong disagreement between scaling optimists and skeptics

Quick Assessment

DimensionAssessmentEvidence
Resolution StatusPartially resolved toward scaling-plusReasoning models (o1, o3) demonstrate new scaling regimes; pure pretraining scaling stalling
Expert Consensus~25% favor pure scaling, ~30% favor new paradigms, ≈45% favor hybridStanford AI Index 2025 surveys; lab behavior
Key Milestone (Pro-Scaling)o3 achieves 87.5% on ARC-AGI-1ARC Prize Technical Report: $3,460/task at maximum compute
Key Milestone (Anti-Scaling)GPT-5 delayed 2 years; pure pretraining hits ceilingFortune (Feb 2025): Industry pivots to reasoning
Data Wall Timeline2026-2030 for human-generated textEpoch AI (2022): Stock exhausted depending on overtraining
Investment Level$500B+ committed through 2029Stargate Project: OpenAI, SoftBank, Oracle joint venture
StakesDetermines timeline predictions (5-15 vs 15-30+ years to AGI)Affects safety research priorities, resource allocation, policy

One of the most consequential debates in AI: Can we achieve AGI simply by making current approaches (transformers, neural networks) bigger, or do we need fundamental breakthroughs in architecture and methodology?

Key Links

SourceLink
Official Websitedebutinfotech.com

The Question

The debate centers on whether the remarkable progress of AI from 2019-2024 will continue along the same trajectory, or whether we're approaching fundamental limits that require new approaches.

Loading diagram...

Scaling hypothesis: Current deep learning approaches will reach human-level and superhuman intelligence through:

  • More compute (bigger models, longer training)
  • More data (larger, higher-quality datasets)
  • Better engineering (efficiency improvements)

New paradigms hypothesis: We need fundamentally different approaches because current methods hit fundamental limits.

The Evidence Landscape

Evidence TypeFavors ScalingFavors New ParadigmsInterpretation
GPT-3 → GPT-4 gainsStrong: Major capability jumpsPretraining scaling worked through 2023
GPT-4 → GPT-5 delaysStrong: 2-year development timeFortune: Pure pretraining ceiling hit
o1/o3 reasoning modelsStrong: New scaling regime foundModerate: Required paradigm shiftReinforcement learning unlocked gains
ARC-AGI-1 scoresStrong: o3 achieves 87.5%Moderate: $3,460/task costBrute force, not generalization
ARC-AGI-2 benchmarkStrong: Under 5% for all modelsHumans still solve 100%
Model convergenceModerate: Top-10 Elo gap shrunk 11.9% → 5.4%Stanford AI Index: Diminishing differentiation
Parameter efficiencyStrong: 142x reduction for MMLU 60%540B (2022) → 3.8B (2024)

Key Positions

Positions on Scaling

Where different researchers and organizations stand

Ilya Sutskever (OpenAI)strong-scaling

Has consistently predicted that scaling will be sufficient. OpenAI's strategy is built on this.

Evidence: GPT-2/3/4 trajectory; Scaling law predictions
Unsupervised learning + scaling is all you need
Confidence: high
Dario Amodei (Anthropic)scaling-plus

Believes scaling is primary driver but with important safety additions (Constitutional AI, etc.)

Evidence: Anthropic's research strategy
Scaling works, but we need to scale safely
Confidence: high
Yann LeCun (Meta)new-paradigm

Argues LLMs are missing crucial components like world models and planning.

Evidence: JEPA proposal; Critique of autoregressive models
Auto-regressive LLMs are a dead end for AGI
Confidence: high
Gary Marcusstrong-skeptic

Argues deep learning is fundamentally limited, scaling just makes bigger versions of the same limitations.

Evidence: Persistent reasoning failures; Lack of compositionality
Scaling just gives you more of the same mistakes
Confidence: high
DeepMindscaling-plus

Combines scaling with algorithmic innovations (AlphaGo, AlphaFold, Gemini)

Evidence: Hybrid approaches
Scale and innovation together
Confidence: medium
François Cholletnew-paradigm

Created ARC benchmark to show LLMs can't generalize. Argues we need fundamentally different approaches.

Evidence: ARC benchmark results; On the Measure of Intelligence
LLMs memorize, they don't generalize
Confidence: high

Key Cruxes

Key Questions

  • ?Will scaling unlock planning and reasoning?
    Yes - these are emergent capabilities

    Many capabilities emerged unpredictably. Planning/reasoning may too at sufficient scale.

    Continue scaling, AGI within years

    Confidence: medium
    No - these require architectural changes

    These capabilities require different computational structures than next-token prediction.

    Need new paradigms, AGI more distant

    Confidence: medium
  • ?Is the data wall real?
    Yes - we'll run out of quality data soon

    Finite internet, synthetic data degrades. Fundamental limit on scaling.

    Scaling hits wall by ~2026

    Confidence: medium
    No - many ways around it

    Synthetic data, multimodal, data efficiency, curriculum learning all help.

    Scaling can continue for decade+

    Confidence: medium
  • ?Do reasoning failures indicate fundamental limits?
    Yes - architectural gap

    Same types of failures persist across scales. Not improving on these dimensions.

    Scaling insufficient

    Confidence: high
    No - just need more scale

    Performance is improving. May cross threshold with more scale.

    Keep scaling

    Confidence: low
  • ?What would disprove the scaling hypothesis?
    Scaling 100x with no qualitative improvement

    If we scale 100x from GPT-4 and see only incremental gains, suggests limits.

    Would validate skeptics

    Confidence: medium
    Running out of data/compute

    If practical limits prevent further scaling, question becomes moot.

    Would require new approaches by necessity

    Confidence: medium

What Would Change Minds?

For scaling optimists to update toward skepticism:

  • Scaling 100x with only marginal capability improvements
  • Hitting hard data or compute walls
  • Proof that key capabilities (planning, causality) can't emerge from current architectures
  • Persistent failures on simple reasoning despite increasing scale

For skeptics to update toward scaling:

  • GPT-5/6 showing qualitatively new reasoning capabilities
  • Solving ARC or other generalization benchmarks via pure scaling
  • Continued emergent abilities at each scale-up
  • Clear path around data limitations

The Data Wall

A critical constraint on scaling is the availability of training data. Epoch AI research projects that high-quality human-generated text will be exhausted between 2026-2030, depending on training efficiency.

Data Availability Projections

Data SourceCurrent UsageExhaustion TimelineMitigation
High-quality web text≈300B tokens/year2026-2028Quality filtering, multimodal
Books and academic papers≈10% utilized2028-2030OCR improvements, licensing
Code repositories≈50B tokens/year2027-2029Synthetic generation
Multimodal (video, audio)Under 5% utilized2030+Epoch AI: Could 3x available data
Synthetic dataNascentUnlimited potentialMicrosoft SynthLLM: Performance plateaus at 300B tokens

Elon Musk stated in 2024 that AI has "already exhausted all human-generated publicly available data." However, Anthropic's position is that "data quality and quantity challenges are a solvable problem rather than a fundamental limitation," with synthetic data remaining "highly promising."

The Synthetic Data Question

A key uncertainty is whether synthetic data can substitute for human-generated data. Research shows mixed results:

  • Positive: Microsoft's SynthLLM demonstrates scaling laws hold for synthetic data
  • Negative: A Nature study found that "abusing" synthetic data leads to "irreversible defects" and "model collapse" after a few generations
  • Nuanced: Performance improvements plateau at approximately 300B synthetic tokens

Implications for AI Safety

This debate has major implications for AI safety strategy, resource allocation, and policy priorities.

Timeline and Strategy Implications

ScenarioAGI TimelineSafety Research PriorityPolicy Urgency
Scaling works5-10 yearsLLM alignment, RLHF improvementsCritical: Must act now
Scaling-plus8-15 yearsReasoning model safety, scalable oversightHigh: 5-10 year window
New paradigms15-30+ yearsBroader alignment theory, unknown architecturesModerate: Time to prepare
Hybrid10-20 yearsBoth LLM and novel approachesHigh: Uncertainty requires robustness

If scaling works:

  • Short timelines (AGI within 5-10 years)
  • Predictable capability trajectory
  • Safety research can focus on aligning scaled-up LLMs
  • Winner-take-all dynamics (whoever scales most wins)

If new paradigms needed:

  • Longer timelines (10-30+ years)
  • More uncertainty about capability trajectory
  • Safety research needs to consider unknown architectures
  • More opportunity for safety-by-default designs

Hybrid scenario (emerging consensus):

  • Medium timelines (5-15 years)
  • Some predictability, some surprises
  • Safety research should cover both scaled LLMs and new architectures
  • The o1/o3 reasoning paradigm suggests this is the most likely path

Resource Allocation Implications

The debate affects billions of dollars in investment decisions:

  • Stargate Project: $500B committed through 2029 by OpenAI, SoftBank, Oracle—implicitly betting on scaling
  • Meta's LLM focus: Yann LeCun's November 2025 departure to found Advanced Machine Intelligence Labs signals internal disagreement
  • DeepMind's approach: Combines scaling with algorithmic innovation (AlphaFold, Gemini)—hedging both sides

Historical Parallels

Cases where scaling worked:

  • ImageNet → Deep learning revolution (2012)
  • GPT-2 → GPT-3 → GPT-4 trajectory
  • AlphaGo scaling to AlphaZero
  • Transformer scaling unlocking new capabilities

Cases where new paradigms were needed:

  • Perceptrons → Neural networks (needed backprop + hidden layers)
  • RNNs → Transformers (needed attention mechanism)
  • Expert systems → Statistical learning (needed paradigm shift)

The question: Which pattern are we in now?

2024-2025: The Scaling Debate Intensifies

The past two years have provided significant new evidence, though interpretation remains contested.

Key Developments

DateEventImplications
Sep 2024OpenAI releases o1 reasoning modelNew scaling paradigm: test-time compute
Dec 2024o3 achieves 87.5% on ARC-AGI-1ARC Prize: "Surprising step-function increase"
Dec 2024Ilya Sutskever NeurIPS speech"Pretraining as we know it will end"
Feb 2025GPT-5 pivot revealed2-year delay; pure pretraining ceiling hit
May 2025ARC-AGI-2 benchmark launchedAll frontier models score under 5%; humans 100%
Aug 2025GPT-5 releasedPerformance gains mainly from inference-time reasoning
Nov 2025Yann LeCun leaves MetaFounds AMI Labs to pursue world models
Jan 2026Davos AI debatesHassabis vs LeCun on AGI timelines

The Reasoning Revolution

The emergence of "reasoning models" in 2024-2025 partially resolved the debate by introducing a new scaling paradigm:

  • Test-time compute scaling: OpenAI observed that reinforcement learning exhibits "more compute = better performance" trends similar to pretraining
  • o3 benchmark results: 96.7% on AIME 2024, 87.7% on GPQA Diamond, 71.7% on SWE-bench Verified (vs o1's 48.9%)
  • Key insight: Rather than scaling model parameters, scale inference-time reasoning through reinforcement learning

This suggests a "scaling-plus" resolution: pure pretraining scaling has diminishing returns, but new scaling regimes (reasoning, test-time compute) can unlock continued progress.

Expert Positions Have Shifted

Around 75% of AI experts don't believe scaling LLMs alone will lead to AGI—but many now believe scaling reasoning could work:

Expert2023 Position2025 PositionKey Quote
Sam AltmanPure scaling worksScaling + reasoning"There is no wall" (disputed)
Dario AmodeiScaling is primaryScaling "probably will continue"Synthetic data "highly promising"
Yann LeCunSkepticStrong skeptic"LLMs are a dead end for AGI"
Ilya SutskeverStrong scaling optimistNuanced"Pretraining as we know it will end"
François CholletSkepticSkeptic validatedPredicts human-level AI 2038-2048
Demis HassabisHybrid approachAGI by 2030 possibleScaling + algorithmic innovation

Sources and Further Reading

Related Pages

Top Related Pages

Concepts

Meta AI (FAIR)Sam AltmanIlya SutskeverLarge Language ModelsAGI TimelineDense Transformers

Key Debates

When Will AGI Arrive?Open vs Closed Source AIGovernment Regulation vs Industry Self-GovernanceIs Interpretability Sufficient for Safety?

Safety Research

Scalable Oversight

Models

Safety-Capability Tradeoff ModelAlignment Robustness Trajectory Model

Transition Model

AI CapabilitiesAI Capabilities

Risks

Emergent Capabilities

Labs

Safe Superintelligence Inc.

Organizations

Alignment Research CenterEpoch AI

Historical

Deep Learning Revolution Era