Skip to content
Longterm Wiki

GSM8K

Math
Grade School Math 8K — a dataset of 8,500 linguistically diverse grade-school math word problems requiring multi-step reasoning with basic arithmetic operations.
Models Tested
13
Best Score
98.4
Median Score
92.3
Scoring: accuracy
Introduced: 2021-10
Maintainer: OpenAI

Leaderboard (13 models)

#ModelDeveloperScore
🥇Claude Opus 4.6Anthropic
98.4
🥈LlamaMeta AI (FAIR)
96.8
🥉Llama 3.1Meta AI (FAIR)
96.8%
4Claude 3.7 SonnetAnthropic
96.4
5Claude 3.5 SonnetAnthropic
96.4
6Claude Opus 4.5Anthropic
95
7Claude 3 SonnetAnthropic
92.3
8GPTOpenAI
92
9GrokxAI
89.3
10DeepSeek ModelsDeepSeek
89.3
11DeepSeek V3DeepSeek
89.3%
12GPT-3.5 TurboOpenAI
57.1
13MistralMistral AI
40.3