Skip to content
Longterm Wiki

HellaSwag

Reasoning
A commonsense natural language inference benchmark testing whether models can predict the most plausible continuation of a scenario. Uses adversarial filtering against LMs.
Models Tested
8
Best Score
95.4%
Median Score
90.75%
Scoring: accuracy
Introduced: 2019-05
Maintainer: AI2 / University of Washington

Leaderboard (8 models)

#ModelDeveloperScore
🥇Claude 3 OpusAnthropic
95.4%
🥈GPT-4OpenAI
95.3%
🥉GPTOpenAI
95
4Gemini 1.5 ProGoogle DeepMind
92.5%
5Claude 3.5 SonnetAnthropic
89%
6Gemini 1.0 UltraGoogle DeepMind
87.8%
7GPT-3.5 TurboOpenAI
85.5
8MistralMistral AI
84