Rating System
Rating System
LongtermWiki uses a multi-dimensional rating system combining LLM-graded subscores with automated metrics to produce a derived quality score (0-100).
Quick Reference
# Grade a single page
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --page scheming
# Grade all pages (with cost estimate)
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --dry-run
# Grade and apply to frontmatter
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --apply
# Parallel processing (faster, higher API cost)
ANTHROPIC_API_KEY=sk-... node scripts/content/grade-content.mjs --parallel 5 --apply
Score Components
1. Importance (0-100)
How significant is this page for AI risk prioritization work?
| Range | Description | Expected Count |
|---|---|---|
| 90-100 | Essential for prioritization decisions | 5-10 pages |
| 70-89 | High value for practitioners | 30-50 pages |
| 50-69 | Useful context | 80-100 pages |
| 30-49 | Reference material | 60-80 pages |
| 0-29 | Peripheral | 30-50 pages |
Category adjustments applied to base assessment:
- Responses/interventions: +10
- Capabilities: +5
- Core risks: +5
- Risk factors: 0
- Models/analysis: -5
- Arguments/debates: -10
- People/organizations: -15
- Internal/infrastructure: -30
2. Quality Subscores (0-10 each)
Scoring is harsh: a 7 is exceptional, 8+ is world-class. Most content should score 3-5.
Novelty
How original is the content beyond its sources?
| Score | Meaning |
|---|---|
| 9-10 | Groundbreaking original research (academic publication level) |
| 7-8 | Significant original synthesis not found elsewhere |
| 5-6 | Some original framing, modest value beyond sources |
| 3-4 | Accurate summary with minimal original perspective |
| 1-2 | Mostly restates common knowledge |
Rigor
How well-evidenced and precise are the claims?
| Score | Meaning |
|---|---|
| 9-10 | Every claim sourced to primary sources, quantified with uncertainty |
| 7-8 | Nearly all claims well-sourced and quantified |
| 5-6 | Most major claims sourced, some quantification |
| 3-4 | Mix of sourced and unsourced, vague claims common |
| 1-2 | Few sources, mostly assertions |
Actionability
How useful is this for making decisions?
| Score | Meaning |
|---|---|
| 9-10 | Specific decision procedures with quantified tradeoffs |
| 7-8 | Clear concrete recommendations with supporting analysis |
| 5-6 | Some actionable takeaways, general guidance |
| 3-4 | Mostly abstract, implications unclear |
| 1-2 | Purely descriptive, no practical application |
Completeness
How comprehensive is the coverage?
| Score | Meaning |
|---|---|
| 9-10 | Exhaustive authoritative reference (textbook-level) |
| 7-8 | Covers all major aspects thoroughly with depth |
| 5-6 | Covers main points, some gaps |
| 3-4 | Notable gaps, missing important aspects |
| 1-2 | Very incomplete, barely started |
Objectivity
Epistemic honesty, language neutrality, and analytical (not prescriptive) tone. See Common Writing Principles for full guidance.
| Score | Meaning |
|---|---|
| 9-10 | Every uncertain claim hedged with ranges; fully accessible to outsiders; presents tradeoffs without advocating |
| 7-8 | Nearly all estimates include ranges; no insider jargon; analytical throughout; honest counter-arguments |
| 5-6 | Mostly neutral language; some uncertainty acknowledged; mostly analytical |
| 3-4 | Uses insider jargon ("EA money"); presents estimates as facts ("True Cost"); one-sided framing |
| 1-2 | Heavy insider language; false certainty; reads as advocacy not analysis |
Concreteness
How specific vs. abstract?
| Score | Meaning |
|---|---|
| 9-10 | Specific numbers, examples, recommendations throughout |
| 7-8 | Mostly concrete with specific details |
| 5-6 | Mix of concrete and abstract |
| 3-4 | Mostly abstract, vague generalities |
| 1-2 | Almost entirely abstract hand-waving |
3. Automated Metrics
These are computed directly from content, not LLM-graded:
| Metric | What It Measures | How Computed |
|---|---|---|
wordCount | Prose words (excluding tables) | Strip tables, code blocks, imports, components |
citations | External sources | Count <R id=...> + markdown links [](https://...) |
tables | Data tables | Count |---| patterns |
diagrams | Visual elements | Count <Mermaid> +  images |
4. Derived Quality Score (0-100)
quality = (avgSubscore × 8) + lengthBonus + evidenceBonus
Where:
avgSubscore= weighted average of all 7 dimensions (focus, novelty, rigor, completeness, objectivity, concreteness, actionability) → contributes 0-80. Weights vary by content type — analysis pages weight focus, novelty, objectivity, concreteness higher; reference pages weight rigor, completeness higher.lengthBonus= min(8, wordCount / 600) → contributes 0-8evidenceBonus= min(7, citations × 0.35) → contributes 0-7
Subscores are the primary driver (~80% of score). Bonuses reward depth but can't compensate for weak content.
| Quality Range | Label | Meaning |
|---|---|---|
| 80-100 | Comprehensive | Fully developed, authoritative |
| 60-79 | Good | Solid content, minor gaps |
| 40-59 | Adequate | Useful but needs work |
| 20-39 | Draft | Early stage, significant gaps |
| 0-19 | Stub | Placeholder only |
Frontmatter Schema
After grading, pages have this frontmatter structure:
---
title: "Page Title"
description: "Executive summary with methodology AND conclusions"
quality: 65 # Derived 0-100
importance: 75 # LLM-assessed 0-100
lastEdited: "2026-01-28"
ratings:
focus: 6.0 # 0-10 scale
novelty: 4.5
rigor: 5.2
completeness: 5.0
objectivity: 5.5
concreteness: 4.8
actionability: 4.8
llmSummary: "This page analyzes X using Y methodology. It finds that Z with N% probability."
---
Script Options
node scripts/content/grade-content.mjs [options]
Options:
--page ID Grade single page by ID or partial match
--dry-run Preview without API calls
--limit N Process only N pages
--parallel N Concurrent API requests (default: 1)
--category X Filter by category (models, risks, responses)
--skip-graded Skip pages with existing importance
--output FILE JSON output path (default: .claude/temp/grades-output.json)
--apply Write grades directly to frontmatter
Cost Estimates
| Scenario | Input Tokens | Output Tokens | Cost |
|---|---|---|---|
| Single page | ≈4K | ≈200 | ≈$1.05 |
| All 300 pages | ≈1.2M | ≈60K | ≈$15 |
| 10 pages parallel | ≈40K | ≈2K | ≈$1.50 |
Validation
Pages are validated against quality criteria based on their type:
npm run crux -- validate templates # Template structure
npm run crux -- validate unified --rules=placeholders # Incomplete content
See Page Types for which pages are validated.
Related Documentation
- Common Writing Principles - Cross-cutting writing standards (objectivity dimension)
- Models Style Guide - Requirements for analytical model pages
- Risk Style Guide - Requirements for risk analysis pages
- Response Style Guide - Requirements for intervention pages
- Page Types - How page types affect validation