Models Tested
8
Best Score
95.4%
Median Score
90.75%
Scoring: accuracy
Introduced: 2019-05
Maintainer: AI2 / University of Washington
Leaderboard (8 models)
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Claude 3 Opus | Anthropic | 95.4% |
| 🥈 | GPT-4 | OpenAI | 95.3% |
| 🥉 | GPT | OpenAI | 95 |
| 4 | Gemini 1.5 Pro | Google DeepMind | 92.5% |
| 5 | Claude 3.5 Sonnet | Anthropic | 89% |
| 6 | Gemini 1.0 Ultra | Google DeepMind | 87.8% |
| 7 | GPT-3.5 Turbo | OpenAI | 85.5 |
| 8 | Mistral | Mistral AI | 84 |