Skip to content
Longterm Wiki

SimpleQA

Knowledge

A factual question-answering benchmark from OpenAI testing short, fact-seeking questions with verifiable answers. Evaluates factual accuracy and calibration.

Models Tested
3
Best Score
52.9%
Median Score
47.6%
Scoring: accuracy
Introduced: 2024-10
Maintainer: OpenAI

Leaderboard3 models

#ModelDeveloperScore
🥇Gemini 2.5 ProGoogle DeepMind
52.9%
🥈o3OpenAI
47.6%
🥉GPT-4.1OpenAI
41.5%