Skip to content
Longterm Wiki
Search
Entities
Research
Policy
Sources
FactBase
About
Internal
Search
⌘K
Benchmarks
/
TruthfulQA
TruthfulQA
Safety
Wiki page
Data
A benchmark of 817 questions designed to test whether language models generate truthful answers, specifically targeting common misconceptions and falsehoods that models tend to reproduce.
Models Tested
1
Best Score
47
Median Score
47
Scoring:
accuracy
Introduced:
2021-09
Maintainer:
Oxford
Leaderboard
(1 model)
#
Model
Developer
Score
🥇
GPT-3.5 Turbo
OpenAI
47