Skip to content
Longterm Wiki

All Source Checks

Automated source checking of wiki data against original sources. Each record is checked against one or more external sources to confirm accuracy.

View internal dashboard with coverage & action queue →

Verified Correct

65

97% of checked

Has Issues

0

0% of checked

Can't Verify

2

3% of checked

Not Yet Checked

0

of 67 total

Contradicted

0

None found

Outdated

0

All current

Accuracy Rate

100%

confirmed / (confirmed + wrong + outdated)

Needs Recheck

0

All up to date

65 results
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / GSM8K: 57.1

sid_bFjrDfX8rQ / GSM8K: 57.1·Apr 29, 2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / DROP: 61.4

sid_bFjrDfX8rQ / DROP: 61.4·Apr 29, 2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / HellaSwag: 85.5

sid_bFjrDfX8rQ / HellaSwag: 85.5·Apr 29, 2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / TruthfulQA: 47

sid_bFjrDfX8rQ / TruthfulQA: 47·Apr 29, 2026
Benchmark Resultconfirmed

sid_bFjrDfX8rQ / WinoGrande: 81.6

sid_bFjrDfX8rQ / WinoGrande: 81.6·Apr 29, 2026
Benchmark Resultconfirmed

sid_oSG59ppF7g / MMLU: 80.1

sid_oSG59ppF7g / MMLU: 80.1·Apr 29, 2026
Benchmark Resultconfirmed

sid_oSG59ppF7g / Aider Polyglot: 9.8

sid_oSG59ppF7g / Aider Polyglot: 9.8·Apr 29, 2026
Benchmark Resultconfirmed

sid_kWPQCvjKSg / MMLU: 87.3

sid_kWPQCvjKSg / MMLU: 87.3·Apr 29, 2026
Benchmark Resultconfirmed

sid_kWPQCvjKSg / HumanEval: 89

sid_kWPQCvjKSg / HumanEval: 89·Apr 29, 2026
Benchmark Resultconfirmed

sid_kWPQCvjKSg / MATH: 73.8

sid_kWPQCvjKSg / MATH: 73.8·Apr 29, 2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / GSM8K: 95

sid_tppPAkJqjQ / GSM8K: 95·Apr 29, 2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / HumanEval: 92

sid_tppPAkJqjQ / HumanEval: 92·Apr 29, 2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / MMLU-Pro: 89.5

sid_tppPAkJqjQ / MMLU-Pro: 89.5·Apr 29, 2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / SimpleQA: 36

sid_tppPAkJqjQ / SimpleQA: 36·Apr 29, 2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / LiveCodeBench: 70.3

sid_tppPAkJqjQ / LiveCodeBench: 70.3·Apr 29, 2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / MGSM: 92.5

sid_tppPAkJqjQ / MGSM: 92.5·Apr 29, 2026
Benchmark Resultconfirmed

sid_tppPAkJqjQ / Humanity's Last Exam: 43.2

sid_tppPAkJqjQ / Humanity's Last Exam: 43.2·Apr 29, 2026
Benchmark Resultconfirmed

sid_Ac7c55KtVw / MMLU: 92.1

sid_Ac7c55KtVw / MMLU: 92.1·Apr 29, 2026
Benchmark Resultconfirmed

sid_Ac7c55KtVw / HumanEval: 95.4

sid_Ac7c55KtVw / HumanEval: 95.4·Apr 29, 2026
Benchmark Resultconfirmed

sid_Ac7c55KtVw / BrowseComp: 84

sid_Ac7c55KtVw / BrowseComp: 84·Apr 29, 2026
Benchmark Resultconfirmed

sid_Ac7c55KtVw / MMMU: 76.5

sid_Ac7c55KtVw / MMMU: 76.5·Apr 29, 2026
Benchmark Resultconfirmed

sid_Ac7c55KtVw / GSM8K: 98.4

sid_Ac7c55KtVw / GSM8K: 98.4·Apr 29, 2026
Benchmark Resultconfirmed

sid_Ac7c55KtVw / IFEval: 91.2

sid_Ac7c55KtVw / IFEval: 91.2·Apr 29, 2026
Benchmark Resultconfirmed

sid_ePVee3jidQ / MMMU: 69.1

Claude 3.7 Sonnet·Apr 24, 2026
Benchmark Resultconfirmed

sid_ePVee3jidQ / LiveCodeBench: 65.4

Claude 3.7 Sonnet·Apr 24, 2026
Benchmark Resultconfirmed

sid_ePVee3jidQ / GSM8K: 96.4

Claude 3.7 Sonnet·Apr 24, 2026
Benchmark Resultconfirmed

sid_ePVee3jidQ / MMLU-Pro: 78.4

Claude 3.7 Sonnet·Apr 24, 2026
Benchmark Resultconfirmed

sid_ePVee3jidQ / HumanEval: 94

Claude 3.7 Sonnet·Apr 24, 2026
Benchmark Resultconfirmed

sid_ISfAiImMYg / SWE-bench Verified: 49

Claude 3.5 Sonnet·Apr 24, 2026
Benchmark Resultconfirmed

sid_ISfAiImMYg / GSM8K: 96.4

Claude 3.5 Sonnet·Apr 24, 2026
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / HumanEval: 30.5

Mistral·Apr 24, 2026
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / GSM8K: 40.3

Mistral·Apr 24, 2026
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / HellaSwag: 84

Mistral·Apr 24, 2026
Benchmark Resultconfirmed

sid_v1e1ZwDwoA / MMLU: 60.1

Mistral·Apr 24, 2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / LiveCodeBench: 79.4

Grok·Apr 24, 2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / Chatbot Arena Elo: 1402

Grok·Apr 24, 2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / HumanEval: 86.5

Grok·Apr 24, 2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / GSM8K: 89.3

Grok·Apr 24, 2026
Benchmark Resultconfirmed

sid_nnv09Wl5OQ / MMLU-Pro: 79.9

Grok·Apr 24, 2026
Benchmark Resultconfirmed

sid_nywmt9QdsA / MMLU: 80.1

GPT-4.1 mini·Apr 24, 2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / HellaSwag: 95

GPT·Apr 24, 2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / GSM8K: 92

GPT·Apr 24, 2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / MATH: 76.6

GPT·Apr 24, 2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / MGSM: 90.5

GPT·Apr 24, 2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / HumanEval: 90.2

GPT·Apr 24, 2026
Benchmark Resultconfirmed

sid_Gqv7h9oEwA / MMLU: 88.7

GPT·Apr 24, 2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / MATH: 78.3

Gemini·Apr 24, 2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / HumanEval: 89.7

Gemini·Apr 24, 2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / MMLU: 92.4

Gemini·Apr 24, 2026
Benchmark Resultconfirmed

sid_PaKhQQNPkg / Humanity's Last Exam: 44.4

Gemini·Apr 24, 2026
Showing 150 of 65
PrevPage 1 of 2Next

Data from source_check_verdicts table. Click a row to view detailed evidence.