Skip to content
Longterm Wiki
Multilingual Grade School Math — a benchmark of 250 grade-school math problems translated into 10 typologically diverse languages. Tests multilingual mathematical reasoning.
Models Tested
10
Best Score
92.5
Median Score
91.05
Scoring: accuracy
Introduced: 2022-10
Maintainer: Google Research

Leaderboard (10 models)

#ModelDeveloperScore
🥇Claude Opus 4.5Anthropic
92.5
🥈Claude 3.7 SonnetAnthropic
92.4%
🥉Gemini 2.5 ProGoogle DeepMind
92.2%
4Claude 3.5 SonnetAnthropic
91.6%
5Llama 3.1Meta AI (FAIR)
91.6%
6GPTOpenAI
90.5
7GPT-4oOpenAI
90.5%
8Claude 3.5 HaikuAnthropic
85.6%
9Claude 3 SonnetAnthropic
83.5
10Gemini 1.5 FlashGoogle DeepMind
82.6%