Rating System

LongtermWiki uses a multi-dimensional rating system combining LLM-graded subscores with automated metrics to produce a derived quality score (0-100).

Quick Reference

# Grade a single page
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --page scheming

# Grade all pages (with cost estimate)
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --dry-run

# Grade and apply to frontmatter
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --apply

# Parallel processing (faster, higher API cost)
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --parallel 5 --apply

Score Components

1. Importance (0-100)

How significant is this page for AI risk prioritization work?

Range	Description	Expected Count
90-100	Essential for prioritization decisions	5-10 pages
70-89	High value for practitioners	30-50 pages
50-69	Useful context	80-100 pages
30-49	Reference material	60-80 pages
0-29	Peripheral	30-50 pages

Category adjustments applied to base assessment:

Responses/interventions: +10
Capabilities: +5
Core risks: +5
Risk factors: 0
Models/analysis: -5
Arguments/debates: -10
People/organizations: -15
Internal/infrastructure: -30

2. Quality Subscores (0-10 each)

Scoring is harsh: a 7 is exceptional, 8+ is world-class. Most content should score 3-5.

Novelty

How original is the content beyond its sources?

Score	Meaning
9-10	Groundbreaking original research (academic publication level)
7-8	Significant original synthesis not found elsewhere
5-6	Some original framing, modest value beyond sources
3-4	Accurate summary with minimal original perspective
1-2	Mostly restates common knowledge

Rigor

How well-evidenced and precise are the claims?

Score	Meaning
9-10	Every claim sourced to primary sources, quantified with uncertainty
7-8	Nearly all claims well-sourced and quantified
5-6	Most major claims sourced, some quantification
3-4	Mix of sourced and unsourced, vague claims common
1-2	Few sources, mostly assertions

Actionability

How useful is this for making decisions?

Score	Meaning
9-10	Specific decision procedures with quantified tradeoffs
7-8	Clear concrete recommendations with supporting analysis
5-6	Some actionable takeaways, general guidance
3-4	Mostly abstract, implications unclear
1-2	Purely descriptive, no practical application

Completeness

How comprehensive is the coverage?

Score	Meaning
9-10	Exhaustive authoritative reference (textbook-level)
7-8	Covers all major aspects thoroughly with depth
5-6	Covers main points, some gaps
3-4	Notable gaps, missing important aspects
1-2	Very incomplete, barely started

Objectivity

Epistemic honesty, language neutrality, and analytical (not prescriptive) tone. See Common Writing Principles for full guidance.

Score	Meaning
9-10	Every uncertain claim hedged with ranges; fully accessible to outsiders; presents tradeoffs without advocating
7-8	Nearly all estimates include ranges; no insider jargon; analytical throughout; honest counter-arguments
5-6	Mostly neutral language; some uncertainty acknowledged; mostly analytical
3-4	Uses insider jargon ("EA money"); presents estimates as facts ("True Cost"); one-sided framing
1-2	Heavy insider language; false certainty; reads as advocacy not analysis

Concreteness

How specific vs. abstract?

Score	Meaning
9-10	Specific numbers, examples, recommendations throughout
7-8	Mostly concrete with specific details
5-6	Mix of concrete and abstract
3-4	Mostly abstract, vague generalities
1-2	Almost entirely abstract hand-waving

3. Automated Metrics

These are computed directly from content, not LLM-graded:

Metric	What It Measures	How Computed
`wordCount`	Prose words (excluding tables)	Strip tables, code blocks, imports, components
`citations`	External sources	Count `<R id=...>` + markdown links `[](https://...)`
`tables`	Data tables	Count `\|---\|` patterns
`diagrams`	Visual elements	Count `<Mermaid>` + `![](...)` images

4. Derived Quality Score (0-100)

quality = (avgSubscore × 8) + lengthBonus + evidenceBonus

Where:

avgSubscore = weighted average of all 7 dimensions (focus, novelty, rigor, completeness, objectivity, concreteness, actionability) → contributes 0-80. Weights vary by content type — analysis pages weight focus, novelty, objectivity, concreteness higher; reference pages weight rigor, completeness higher.
lengthBonus = min(8, wordCount / 600) → contributes 0-8
evidenceBonus = min(7, citations × 0.35) → contributes 0-7

Subscores are the primary driver (~80% of score). Bonuses reward depth but can't compensate for weak content.

Quality Range	Label	Meaning
80-100	Comprehensive	Fully developed, authoritative
60-79	Good	Solid content, minor gaps
40-59	Adequate	Useful but needs work
20-39	Draft	Early stage, significant gaps
0-19	Stub	Placeholder only

Frontmatter Schema

After grading, pages have this frontmatter structure:

---
title: "Page Title"
description: "Executive summary with methodology AND conclusions"
quality: 65           # Derived 0-100
importance: 75        # LLM-assessed 0-100
lastEdited: "2026-01-28"
ratings:
  focus: 6.0          # 0-10 scale
  novelty: 4.5
  rigor: 5.2
  completeness: 5.0
  objectivity: 5.5
  concreteness: 4.8
  actionability: 4.8
llmSummary: "This page analyzes X using Y methodology. It finds that Z with N% probability."
---

Script Options

node scripts/content/grade-content.mjs [options]

Options:
  --page ID       Grade single page by ID or partial match
  --dry-run       Preview without API calls
  --limit N       Process only N pages
  --parallel N    Concurrent API requests (default: 1)
  --category X    Filter by category (models, risks, responses)
  --skip-graded   Skip pages with existing importance
  --output FILE   JSON output path (default: .claude/temp/grades-output.json)
  --apply         Write grades directly to frontmatter

Cost Estimates

Scenario	Input Tokens	Output Tokens	Cost
Single page	≈4K	≈200	≈$1.05
All 300 pages	≈1.2M	≈60K	≈$15
10 pages parallel	≈40K	≈2K	≈$1.50

Validation

Pages are validated against quality criteria based on their type:

npm run crux -- validate templates    # Template structure
npm run crux -- validate unified --rules=placeholders  # Incomplete content

See Page Types for which pages are validated.

Rating System

Rating System

Quick Reference

Score Components

1. Importance (0-100)

2. Quality Subscores (0-10 each)

Novelty

Rigor

Actionability

Completeness

Objectivity

Concreteness

3. Automated Metrics

4. Derived Quality Score (0-100)

Frontmatter Schema

Script Options

Cost Estimates

Validation

Related Documentation