Skip to content
Longterm Wiki
Back

performance gap between US and Chinese models

web

A personal blog post by Jonas Vetterle offering a practitioner-level commentary on AI scaling trends; useful as a snapshot of mainstream discourse around scaling law debates in late 2024, but not a primary research source.

Metadata

Importance: 28/100blog postcommentary

Summary

A blog post analyzing the state of LLM scaling laws as of late 2024/early 2025, examining whether pre-training scaling has stalled and how post-training techniques and test-time compute scaling have driven recent progress. It contextualizes OpenAI's o3 breakthrough against a backdrop of pessimism about AI advancement and discusses the competitive landscape between US and Chinese AI labs.

Key Points

  • OpenAI's o3 achieved major breakthroughs on ARC and FrontierMath benchmarks, suggesting scaling has not fully stalled.
  • 2024 was characterized as a consolidation year where post-training and test-time compute scaling drove most model improvements rather than pre-training.
  • Pessimism about scaling laws 'hitting a wall' was widespread in media coverage, but recent releases challenge that narrative.
  • Workhorse models like GPT-4o and Sonnet 3.5 saw significant capability improvements, particularly in coding and math.
  • The piece frames the US-China model performance gap as a key dimension of the competitive AI landscape in 2025.

Cited by 2 pages

PageTypeQuality
Dense TransformersConcept58.0
AI Scaling LawsConcept92.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202623 KB
[Subscribe](https://www.jonvet.com/subscribe)

Authors

- ![avatar](https://www.jonvet.com/_next/image?url=%2Fstatic%2Fimages%2Fme.png&w=96&q=75)NameJonas VetterleTwitter[@jvetterle](https://twitter.com/jvetterle)

OpenAI just unveiled their new reasoning model, o3, which breaks previous SOTA on the [ARC dataset](https://arcprize.org/blog/oai-o3-pub-breakthrough) by a large margin and scored a breathtaking result on the challenging [FrontierMath](https://epoch.ai/frontiermath) dataset. While we're still updating our priors what this means for the trajectory of AI progress, it's clear that the model is a significant step forward in terms of reasoning capabilities.

However, if you've read recent new coverage (i.e. up until last week) about [stalling AI progress](https://www.ft.com/content/f24ba8d5-4c33-47ef-a91e-8f76340b08c4) including [anonymous leaks](https://www.bloomberg.com/news/articles/2024-11-13/openai-google-and-anthropic-are-struggling-to-build-more-advanced-ai) and the occasional Gary Marcus [rant](https://garymarcus.substack.com/p/a-new-ai-scaling-law-shell-game), you probably noticed a certain degree of pessimism about the speed of advancement. Many were and probably still are wondering whether LLM Scaling Laws, which predict that increases in compute, data and model size lead to ever better models, have "hit a wall". Have we reached a limit in terms of how much we can scale the current paradigm: transformer-based LLMs?

Apart from the releases of the first publicly available reasoning models (OpenAI's o1, Google's Gemini 2.0 Flash, and now also o3 which will be released to public in 2025), most model providers have been focussing on what on the surface looked like incremental improvements to their existing models. In that sense, for the most of it, 2024 has been a year of consolidation - many models have essentially caught up with what used to be the go-to model at the beginning of the year, GPT4.

But that masks the progress that's actually been made to the "work horse" models like GPT-4o, Sonnet 3.5, Llama 3 etc. (i.e. everything that's not a reasoning model), which are most commonly used in AI applications. The big labs have continued to ship new versions of these models that pushed SOTA performance across the board, and which came with huge improvements on tasks like coding and solving math problems.

One cannot but notice that 2024 has been the year in which improvements in model performance were primarily driven by [post-training](https://www.jonvet.com/blog/llm-synthetic-data) and [scaling test-time compute](https://www.jonvet.com/blog/llm-test-time-compute). In terms of pretraining there hasn't been as much news. This has led to some speculation that the (pre-training) scaling laws are breaking down, and that we are reaching the limits of what is possible with current models, data and compute.

Twitter Embed

[Visit this post on X](https://twitter.com/sama/status/1856941766915641580?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7

... (truncated, 23 KB total)
Resource ID: 7226d362130b23f8 | Stable ID: NDAyYTJjNj