Back
SimpleQA: A Benchmark for Evaluating Factual Accuracy in Language Models
webCredibility Rating
4/5
High(4)High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: OpenAI
SimpleQA is relevant to AI safety as factual accuracy and calibrated uncertainty are components of honest, trustworthy AI behavior; hallucination is a known deployment risk.
Metadata
Importance: 55/100tool pagedataset
Summary
SimpleQA is an OpenAI benchmark designed to evaluate the factual accuracy of large language models on short, unambiguous questions with single correct answers. It aims to measure 'calibrated uncertainty' and honest factual recall, providing a clean signal for whether models know what they claim to know. The benchmark is intended to track improvements in model honesty and factuality over time.
Key Points
- •SimpleQA consists of short factual questions with definitive, verifiable single correct answers to reduce ambiguity in evaluation.
- •The benchmark measures whether models answer correctly, abstain appropriately, or hallucinate, helping assess calibration and honesty.
- •Designed to be difficult enough that current frontier models do not saturate it, leaving room to measure progress.
- •Supports OpenAI's broader focus on reducing hallucinations and improving model truthfulness as a safety-relevant capability.
- •Results reveal that even top models frequently hallucinate on factual queries, highlighting ongoing alignment challenges.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Capability | 60.0 |
Cached Content Preview
HTTP 200Fetched Mar 15, 20260 KB
OpenAI # 404 Link fell off the grid Window fog draws kinder maps Turn with open hands by [gpt-5.2-thinking(opens in a new window)](https://chatgpt.com/?model=gpt-5.2-thinking&openaicom-did=8ad6d28a-5abd-4a0c-9fda-952afbd29d9c&openaicom_referred=true)
Resource ID:
064e5d8266218028 | Stable ID: OTU0ZjhmYj