Skip to content
Longterm Wiki
Back

Chain-of-thought analysis

paper

Authors

Jason Wei·Xuezhi Wang·Dale Schuurmans·Maarten Bosma·Brian Ichter·Fei Xia·Ed Chi·Quoc Le·Denny Zhou

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Foundational research demonstrating that chain-of-thought prompting significantly improves large language model reasoning capabilities, which is relevant to understanding AI capabilities, limitations, and potential safety implications of advanced reasoning in LLMs.

Paper Details

Citations
0
1205 influential
Year
1970
Methodology
peer-reviewed
Categories
Social Thought and Research

Metadata

arxiv preprintprimary source

Abstract

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

Summary

This paper demonstrates that chain-of-thought (CoT) prompting—providing intermediate reasoning steps as examples—significantly enhances large language models' complex reasoning capabilities. By prompting models with just a few CoT demonstrations, the authors show substantial performance improvements across arithmetic, commonsense, and symbolic reasoning tasks. Notably, a 540B-parameter model with eight CoT exemplars achieves state-of-the-art results on GSM8K math word problems, outperforming finetuned GPT-3 with a verifier, suggesting that reasoning abilities emerge naturally in sufficiently large models through this simple prompting technique.

Cited by 5 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Chain-of-Thought Prompting Elicits Reasoning    in Large Language Models

Jason Wei        Xuezhi Wang        Dale Schuurmans        Maarten Bosma

Brian Ichter       Fei Xia       Ed H. Chi       Quoc V. Le       Denny Zhou

Google Research, Brain Team

{jasonwei,dennyzhou}@google.com

###### Abstract

We explore how generating a chain of thought—a series of intermediate reasoning steps—significantly improves the ability of large language models to perform complex reasoning.
In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain-of-thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting.

Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.
The empirical gains can be striking.
For instance, prompting a PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

![Refer to caption](https://ar5iv.labs.arxiv.org/html/2201.11903/assets/x1.png)Figure 1:
Chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks.
Chain-of-thought reasoning processes are highlighted.

Math Word Problems (GSM8K)00202020404040606060808080100100100333333555555181818575757Solve rate (%)Finetuned GPT-3 175BPrior bestPaLM 540B: standard promptingPaLM 540B: chain-of-thought promptingFigure 2:
PaLM 540B uses chain-of-thought prompting to achieve new state-of-the-art performance on the GSM8K benchmark of math word problems.
Finetuned GPT-3 and prior best are from Cobbe et al. ( [2021](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib16 "")).

## 1 Introduction

The NLP landscape has recently been revolutionized by language models (Peters et al., [2018](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib47 ""); Devlin et al., [2019](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib17 ""); Brown et al., [2020](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib8 ""), inter alia).
Scaling up the size of language models has been shown to confer a range of benefits, such as improved performance and sample efficiency (Kaplan et al., [2020](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib27 ""); Brown et al., [2020](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib8 ""), inter alia).
However, scaling up model size alone has not proved sufficient for achieving high performance on challenging tasks such as arithmetic, commonsense, and symbolic reasoning (Rae et al., [2021](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib50 "")).

This work explores how the reasoning ability of large language models can be unlocked by a simple method motivated by two ideas. First, techniques for arithmetic reasoning can benefit from generating natura

... (truncated, 98 KB total)
Resource ID: 7d42a191f4b30946 | Stable ID: YmIwM2MwNj