Models Tested
4
Best Score
73.9%
Median Score
70.4%
Scoring: accuracy
Introduced: 2023-10
Maintainer: UCLA / Microsoft Research
Leaderboard (4 models)
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | o1 | OpenAI | 73.9% |
| 🥈 | GPT-4.1 mini | OpenAI | 73.1 |
| 🥉 | Claude 3.5 Sonnet | Anthropic | 67.7% |
| 4 | GPT-4o | OpenAI | 63.8% |