Verified Correct
65
97% of checked
Can't Verify
2
3% of checked
Not Yet Checked
0
of 67 total
Accuracy Rate
100%
confirmed / (confirmed + wrong + outdated)
Needs Recheck
0
All up to date
Benchmark Resultconfirmed
sid_PaKhQQNPkg / SWE-bench Verified: 80.6
Gemini·Apr 24, 2026
Benchmark Resultconfirmed
sid_PaKhQQNPkg / ARC-AGI-2: 77.1
Gemini·Apr 24, 2026
Benchmark Resultconfirmed
sid_PaKhQQNPkg / GPQA Diamond: 94.3
Gemini·Apr 24, 2026
Benchmark Resultconfirmed
sid_PaKhQQNPkg / MMLU-Pro: 90.99
Gemini·Apr 24, 2026
Benchmark Resultconfirmed
sid_svlbcrT5oQ / BBH: 87.5
DeepSeek Models·Apr 24, 2026
Benchmark Resultconfirmed
sid_svlbcrT5oQ / GPQA Diamond: 59.1
DeepSeek Models·Apr 24, 2026
Benchmark Resultconfirmed
sid_svlbcrT5oQ / DROP: 91.6
DeepSeek Models·Apr 24, 2026
Benchmark Resultconfirmed
sid_svlbcrT5oQ / MMLU-Pro: 75.9
DeepSeek Models·Apr 24, 2026
Benchmark Resultconfirmed
sid_svlbcrT5oQ / HumanEval: 65.2
DeepSeek Models·Apr 24, 2026
Benchmark Resultconfirmed
sid_svlbcrT5oQ / GSM8K: 89.3
DeepSeek Models·Apr 24, 2026
Benchmark Resultconfirmed
sid_svlbcrT5oQ / MATH: 61.6
DeepSeek Models·Apr 24, 2026
Benchmark Resultconfirmed
sid_svlbcrT5oQ / MMLU: 88.5
DeepSeek Models·Apr 24, 2026
Benchmark Resultconfirmed
sid_dHgSM46fMw / SWE-bench Verified: 74.5
Claude Opus 4.1·Apr 24, 2026
Benchmark Resultconfirmed
sid_dHgSM46fMw / GPQA Diamond: 80.9
Claude Opus 4.1·Apr 24, 2026
Benchmark Resultconfirmed
sid_y87VxEBBIA / SWE-bench Verified: 73.3
Claude Haiku 4.5·Apr 24, 2026
| Type | Entity | Claim | Verdict | Confidence | Sources | Last Checked | |
|---|
| Benchmark Result | Gemini | sid_PaKhQQNPkg / SWE-bench Verified: 80.6 | confirmed | 98% | 1 | Apr 24, 2026 | |
| Benchmark Result | Gemini | sid_PaKhQQNPkg / ARC-AGI-2: 77.1 | confirmed | 98% | 1 | Apr 24, 2026 | |
| Benchmark Result | Gemini | sid_PaKhQQNPkg / GPQA Diamond: 94.3 | confirmed | 98% | 1 | Apr 24, 2026 | |
| Benchmark Result | Gemini | sid_PaKhQQNPkg / MMLU-Pro: 90.99 | confirmed | 98% | 1 | Apr 24, 2026 | |
| Benchmark Result | DeepSeek Models | sid_svlbcrT5oQ / BBH: 87.5 | confirmed | 99% | 1 | Apr 24, 2026 | |
| Benchmark Result | DeepSeek Models | sid_svlbcrT5oQ / GPQA Diamond: 59.1 | confirmed | 99% | 1 | Apr 24, 2026 | |
| Benchmark Result | DeepSeek Models | sid_svlbcrT5oQ / DROP: 91.6 | confirmed | 99% | 1 | Apr 24, 2026 | |
| Benchmark Result | DeepSeek Models | sid_svlbcrT5oQ / MMLU-Pro: 75.9 | confirmed | 95% | 1 | Apr 24, 2026 | |
| Benchmark Result | DeepSeek Models | sid_svlbcrT5oQ / HumanEval: 65.2 | confirmed | 99% | 1 | Apr 24, 2026 | |
| Benchmark Result | DeepSeek Models | sid_svlbcrT5oQ / GSM8K: 89.3 | confirmed | 99% | 1 | Apr 24, 2026 | |
| Benchmark Result | DeepSeek Models | sid_svlbcrT5oQ / MATH: 61.6 | confirmed | 99% | 1 | Apr 24, 2026 | |
| Benchmark Result | DeepSeek Models | sid_svlbcrT5oQ / MMLU: 88.5 | confirmed | 95% | 1 | Apr 24, 2026 | |
| Benchmark Result | Claude Opus 4.1 | sid_dHgSM46fMw / SWE-bench Verified: 74.5 | confirmed | 99% | 1 | Apr 24, 2026 | |
| Benchmark Result | Claude Opus 4.1 | sid_dHgSM46fMw / GPQA Diamond: 80.9 | confirmed | 95% | 1 | Apr 24, 2026 | |
| Benchmark Result | Claude Haiku 4.5 | sid_y87VxEBBIA / SWE-bench Verified: 73.3 | confirmed | 98% | 1 | Apr 24, 2026 | |
Data from source_check_verdicts table. Click a row to view detailed evidence.