Models Tested
9
Best Score
44.4
Median Score
11
Scoring: accuracy
Introduced: 2025-01
Maintainer: Scale AI / Center for AI Safety
Leaderboard (9 models)
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | Gemini | Google DeepMind | 44.4 |
| 🥈 | Claude Opus 4.5 | Anthropic | 43.2 |
| 🥉 | o3 | OpenAI | 19.8% |
| 4 | Gemini 2.5 Pro | Google DeepMind | 18.8% |
| 5 | Gemini 2.5 Flash | Google DeepMind | 11% |
| 6 | DeepSeek R1 | DeepSeek | 8.6% |
| 7 | o1 | OpenAI | 8% |
| 8 | Claude 3.5 Sonnet | Anthropic | 4.1% |
| 9 | GPT-4o | OpenAI | 3.3% |