Skip to content
Longterm Wiki

HellaSwag

Reasoning

A commonsense natural language inference benchmark testing whether models can predict the most plausible continuation of a scenario. Uses adversarial filtering against LMs.

Models Tested
5
Best Score
95.4%
Median Score
92.5%
Scoring: accuracy
Introduced: 2019-05
Maintainer: AI2 / University of Washington

Leaderboard5 models

#ModelDeveloperScore
🥇Claude 3 OpusAnthropic
95.4%
🥈GPT-4OpenAI
95.3%
🥉Gemini 1.5 ProGoogle DeepMind
92.5%
4Claude 3.5 SonnetAnthropic
89%
5Gemini 1.0 UltraGoogle DeepMind
87.8%