HellaSwag

Reasoning

A commonsense natural language inference benchmark testing whether models can predict the most plausible continuation of a scenario. Uses adversarial filtering against LMs.

Models Tested

Best Score

95.4%

Median Score

90.75%

Scoring: accuracy

Introduced: 2019-05

Maintainer: AI2 / University of Washington

Leaderboard (8 models)

#	Model	Developer	Score
🥇	Claude 3 Opus	Anthropic	95.4%
🥈	GPT-4	OpenAI	95.3%
🥉	GPT	OpenAI	95
4	Gemini 1.5 Pro	Google DeepMind	92.5%
5	Claude 3.5 Sonnet	Anthropic	89%
6	Gemini 1.0 Ultra	Google DeepMind	87.8%
7	GPT-3.5 Turbo	OpenAI	85.5
8	Mistral	Mistral AI	84