Skip to content
Longterm Wiki

TruthfulQA

Safety
A benchmark of 817 questions designed to test whether language models generate truthful answers, specifically targeting common misconceptions and falsehoods that models tend to reproduce.
Models Tested
1
Best Score
47
Median Score
47
Scoring: accuracy
Introduced: 2021-09
Maintainer: Oxford

Leaderboard (1 model)

#ModelDeveloperScore
🥇GPT-3.5 TurboOpenAI
47