Skip to content
Longterm Wiki

SimpleQA

Knowledge
A factual question-answering benchmark from OpenAI testing short, fact-seeking questions with verifiable answers. Evaluates factual accuracy and calibration.
Models Tested
4
Best Score
52.9%
Median Score
44.55%
Scoring: accuracy
Introduced: 2024-10
Maintainer: OpenAI

Leaderboard (4 models)

#ModelDeveloperScore
🥇Gemini 2.5 ProGoogle DeepMind
52.9%
🥈o3OpenAI
47.6%
🥉GPT-4.1OpenAI
41.5%
4Claude Opus 4.5Anthropic
36