Skip to content
Longterm Wiki
Back

SimpleQA: A Benchmark for Evaluating Factual Accuracy in Language Models

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

SimpleQA is relevant to AI safety as factual accuracy and calibrated uncertainty are components of honest, trustworthy AI behavior; hallucination is a known deployment risk.

Metadata

Importance: 55/100tool pagedataset

Summary

SimpleQA is an OpenAI benchmark designed to evaluate the factual accuracy of large language models on short, unambiguous questions with single correct answers. It aims to measure 'calibrated uncertainty' and honest factual recall, providing a clean signal for whether models know what they claim to know. The benchmark is intended to track improvements in model honesty and factuality over time.

Key Points

  • SimpleQA consists of short factual questions with definitive, verifiable single correct answers to reduce ambiguity in evaluation.
  • The benchmark measures whether models answer correctly, abstain appropriately, or hallucinate, helping assess calibration and honesty.
  • Designed to be difficult enough that current frontier models do not saturate it, leaving room to measure progress.
  • Supports OpenAI's broader focus on reducing hallucinations and improving model truthfulness as a safety-relevant capability.
  • Results reveal that even top models frequently hallucinate on factual queries, highlighting ongoing alignment challenges.

Cited by 1 page

PageTypeQuality
Large Language ModelsCapability60.0

Cached Content Preview

HTTP 200Fetched Mar 15, 20260 KB
OpenAI

# 404

Link fell off the grid

Window fog draws kinder maps

Turn with open hands

by [gpt-5.2-thinking(opens in a new window)](https://chatgpt.com/?model=gpt-5.2-thinking&openaicom-did=8ad6d28a-5abd-4a0c-9fda-952afbd29d9c&openaicom_referred=true)
Resource ID: 064e5d8266218028 | Stable ID: OTU0ZjhmYj