Chain-of-thought analysis
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Foundational research demonstrating that chain-of-thought prompting significantly improves large language model reasoning capabilities, which is relevant to understanding AI capabilities, limitations, and potential safety implications of advanced reasoning in LLMs.
Paper Details
Metadata
Abstract
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
Summary
This paper demonstrates that chain-of-thought (CoT) prompting—providing intermediate reasoning steps as examples—significantly enhances large language models' complex reasoning capabilities. By prompting models with just a few CoT demonstrations, the authors show substantial performance improvements across arithmetic, commonsense, and symbolic reasoning tasks. Notably, a 540B-parameter model with eight CoT exemplars achieves state-of-the-art results on GSM8K math word problems, outperforming finetuned GPT-3 with a verifier, suggesting that reasoning abilities emerge naturally in sufficiently large models through this simple prompting technique.
Cited by 5 pages
| Page | Type | Quality |
|---|---|---|
| Reasoning and Planning | Capability | 65.0 |
| Mesa-Optimization Risk Analysis | Analysis | 61.0 |
| Capability Elicitation | Approach | 91.0 |
| Emergent Capabilities | Risk | 61.0 |
| Instrumental Convergence | Risk | 64.0 |
Cached Content Preview
# Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei Xuezhi Wang Dale Schuurmans Maarten Bosma
Brian Ichter Fei Xia Ed H. Chi Quoc V. Le Denny Zhou
Google Research, Brain Team
{jasonwei,dennyzhou}@google.com
###### Abstract
We explore how generating a chain of thought—a series of intermediate reasoning steps—significantly improves the ability of large language models to perform complex reasoning.
In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain-of-thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting.
Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.
The empirical gains can be striking.
For instance, prompting a PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
Figure 1:
Chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks.
Chain-of-thought reasoning processes are highlighted.
Math Word Problems (GSM8K)00202020404040606060808080100100100333333555555181818575757Solve rate (%)Finetuned GPT-3 175BPrior bestPaLM 540B: standard promptingPaLM 540B: chain-of-thought promptingFigure 2:
PaLM 540B uses chain-of-thought prompting to achieve new state-of-the-art performance on the GSM8K benchmark of math word problems.
Finetuned GPT-3 and prior best are from Cobbe et al. ( [2021](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib16 "")).
## 1 Introduction
The NLP landscape has recently been revolutionized by language models (Peters et al., [2018](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib47 ""); Devlin et al., [2019](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib17 ""); Brown et al., [2020](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib8 ""), inter alia).
Scaling up the size of language models has been shown to confer a range of benefits, such as improved performance and sample efficiency (Kaplan et al., [2020](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib27 ""); Brown et al., [2020](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib8 ""), inter alia).
However, scaling up model size alone has not proved sufficient for achieving high performance on challenging tasks such as arithmetic, commonsense, and symbolic reasoning (Rae et al., [2021](https://ar5iv.labs.arxiv.org/html/2201.11903#bib.bib50 "")).
This work explores how the reasoning ability of large language models can be unlocked by a simple method motivated by two ideas. First, techniques for arithmetic reasoning can benefit from generating natura
... (truncated, 98 KB total)7d42a191f4b30946 | Stable ID: YmIwM2MwNj