HellaSwag
ReasoningA commonsense natural language inference benchmark testing whether models can predict the most plausible continuation of a scenario. Uses adversarial filtering against LMs.
Models Tested
5
Best Score
95.4%
Median Score
92.5%
Scoring: accuracy
Introduced: 2019-05
Maintainer: AI2 / University of Washington
Leaderboard5 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude 3 Opus | Anthropic | 95.4% |
| 🥈 | GPT-4 | OpenAI | 95.3% |
| 🥉 | Gemini 1.5 Pro | Google DeepMind | 92.5% |
| 4 | Claude 3.5 Sonnet | Anthropic | 89% |
| 5 | Gemini 1.0 Ultra | Google DeepMind | 87.8% |