Models Tested
4
Best Score
92.2%
Median Score
89.35%
Scoring: accuracy
Introduced: 2019-03
Maintainer: AI2
Leaderboard (4 models)
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | DeepSeek R1 | DeepSeek | 92.2% |
| 🥈 | DeepSeek Models | DeepSeek | 91.6 |
| 🥉 | Claude 3.5 Sonnet | Anthropic | 87.1 |
| 4 | GPT-3.5 Turbo | OpenAI | 61.4 |