Skip to content
Longterm Wiki
Back

OpenAI, "Learning to Reason with LLMs" (2024).

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

The o1 model card and research blog post is a landmark capabilities announcement relevant to AI safety because it introduces deliberative reasoning at scale, changing assumptions about model behavior, evaluation difficulty, and the pace of capability gains.

Metadata

Importance: 82/100blog postprimary source

Summary

OpenAI introduces the o1 model series, which uses reinforcement learning to train large language models to reason through complex problems via extended chain-of-thought before responding. The models demonstrate significantly improved performance on challenging benchmarks in mathematics, coding, and scientific reasoning. This represents a major capability advance with implications for both AI applications and AI safety evaluation.

Key Points

  • o1 uses reinforcement learning to develop internal chain-of-thought reasoning, allowing the model to 'think' longer on harder problems before answering.
  • Achieves state-of-the-art results on competition math (AIME), coding (Codeforces), and PhD-level science questions (GPQA), surpassing prior models significantly.
  • The extended reasoning process can be partially inspected but the internal chain-of-thought is distinct from the final visible response, raising interpretability questions.
  • Improved reasoning capability correlates with improved safety behavior in some evaluations, but also introduces new risks around more capable autonomous action.
  • Represents a step toward models that can engage in extended deliberative reasoning, a key threshold in AI capability development.

Cited by 1 page

PageTypeQuality
Large Language ModelsCapability60.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202635 KB
Learning to reason with LLMs \| OpenAI

September 12, 2024

[Release](https://openai.com/research/index/release/)

# Learning to reason with LLMs

[Contributions](https://openai.com/openai-o1-contributions/) [Use o1(opens in a new window)](https://chatgpt.com/?openaicom-did=a80dcbed-8ac8-4187-a334-b7a4de60199d&openaicom_referred=true)

Loading…

Share

OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1‑preview, for immediate use in ChatGPT and to [trusted API users⁠(opens in a new window)](https://platform.openai.com/docs/guides/rate-limits/usage-tiers).

Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.

![The image shows two scatter plots comparing "o1 AIME accuracy" during training and at test time. Both charts have "pass@1 accuracy" on the y-axis and compute (log scale) on the x-axis. The dots indicate increasing accuracy with more compute time.](https://images.ctfassets.net/kftzwdyauwt9/3OO9wpK8pjcdemjd7g50xk/5ec2cc9d11f008cd754e8cefbc1c99f5/compute.png?w=3840&q=90&fm=webp)

o1 performance smoothly improves with both train-time and test-time compute

## Evals

To highlight the reasoning improvement over GPT‑4o, we tested our models on a diverse set of human exams and ML benchmarks. We show that o1 significantly outperforms GPT‑4o on the vast majority of these reasoning-heavy tasks. Unless otherwise specified, we evaluated o1 on the maximal test-time compute setting.

![Competition math eval (AIME 2024)](https://cdn.openai.com/reasoning-evals/v3/headline-mobile-0.png?w=3840&q=90&fm=webp)![Competition math eval (AIME 2024)](https://cdn.openai.com/reasoning-evals/v3/headline-mobile-0-dark.png?w=3840&q=90&fm=webp)

![Competition code (CodeForces)](https://cdn.openai.com/reasoning-evals/v3/headline-mobile-1.png?w=3840&q=90&fm=webp)![Competition code (CodeForces)](https://cdn.openai.com/reasoning-evals/v3/headline-mobile-1-dark.png?w=3840&q=90&fm=webp)

![PhD-Level Science Questions (GPQA Diamond)](https://cdn.openai.com/reasoning-evals/v3/headline-mobile-2.png?w=3840&q=90&fm=webp)![PhD-Level Science Questions (GPQA Diamond)](https://cdn.openai.com/reasoning-evals/v3/headline-mobile-2-dark.png?w=3840&q=90&fm=webp)

o1 greatly improves over GPT-4o on

... (truncated, 35 KB total)
Resource ID: 87fa21e4250398f4 | Stable ID: NjNhZWFhOD