Longterm Wiki

Research-First Page Creation Pipeline

Executive Summary

FindingResultImplication
Research-first worksAll 3 test articles found Wikipedia, primary sources, AND critical perspectivesFront-loading research prevents hallucination
Citation discipline enforced42 inline citations per article (vs ≈40 poorly-sourced in original)"Only use facts.json" rule eliminates unsourced claims
Tables dramatically reduced196 → 5 table rows (97% reduction)Prose-first prompting produces readable content
Standard tier is optimal$10.50 achieved same quality (78) as $15 PremiumReview→gap-fill→polish cycle is worth the cost; extra rewrite isn't
Budget tier has known gapsVerify phase identifies issues but can't fix themGood for drafts, not final articles

Background

The Problem

Our original LessWrong article had 196 table rows and only 0.9 citations per 100 words. It was a data dump, not a readable article. Many factual claims had no sources.

Why Single-Pass Generation Fails

The standard approach of prompting an LLM to "write an article about X" fails because:

  1. Writing before researching - LLM generates plausible content without verified sources
  2. No citation requirement - Facts appear without URLs
  3. Tables as a crutch - LLMs over-produce tables because they're easy to generate
  4. No verification - Errors and gaps persist to final output

The Solution: Research-First Pipeline

We built scripts/content/page-creator-v2.mjs with this structure:

Research → Extract → Synthesize → Review → Gap-fill → Polish

Key innovation: The synthesis phase receives only extracted facts with citations, not raw sources. The prompt explicitly says "If a fact isn't in facts.json, DO NOT include it."


Pipeline Design

Phase Structure

PhaseModelBudgetPurpose
researchSonnet$1-4Gather 10-16 sources via WebSearch/WebFetch
extractSonnet$1.50Pull facts into structured JSON with citations
synthesizeOpus/Sonnet$1.50-2.50Write article from facts.json ONLY
reviewOpus$1.50-2Identify gaps, bias, missing perspectives
gap-fillSonnet$1.50Research topics identified as missing
polishOpus$1.50Integrate new facts, improve prose

Tier Configurations

budget:   research-lite → extract → synthesize-sonnet → verify     (~$4.50)
standard: research → extract → synthesize-opus → review → gap-fill → polish (~$10.50)
premium:  research-deep → extract → synthesize-opus → critical-review → gap-fill → rewrite → polish (~$15)

Citation Enforcement

The extract phase outputs structured JSON:

{
  "facts": [
    {
      "claim": "LessWrong was founded in February 2009",
      "sourceUrl": "https://en.wikipedia.org/wiki/LessWrong",
      "sourceTitle": "Wikipedia",
      "confidence": "high"
    }
  ],
  "controversies": [...],
  "statistics": [...],
  "gaps": ["Topics we have no facts for"]
}

The synthesize prompt then says:

"Every factual claim MUST have an inline citation. If a fact isn't in facts.json, DO NOT include it."


Experiment Design

Test Topics

TopicTierWhy Chosen
MIRIBudgetWell-documented nonprofit, good for testing minimal pipeline
LessWrongStandardExisting page to compare against (quality 43)
AnthropicPremiumHigh-profile, controversial, tests deep research

Metrics Tracked

  • Total cost and time
  • Citation count
  • Table row count
  • Word count
  • Whether controversies section included
  • Self-assessed quality score

Results

Pipeline Completion

All three pipelines completed successfully:

TopicTierTimeCostPhases
MIRIBudget10m$4.504/4
LessWrongStandard16m$10.506/6
AnthropicPremium24m$15.007/7

Quality Metrics

MetricMIRI (Budget)LessWrong (Standard)Anthropic (Premium)
Final Quality75*7878
Word Count≈2,7002,4802,850
Citations≈354242
Tables≈311
Has ControversiesYesYesYes

*Budget tier's verify phase identified gaps but couldn't fix them.

Comparison: Original vs New LessWrong

AspectOriginalNew (Standard)
Table rows1965
URLs/Citations4146
Citation density0.9/100 words2.3/100 words
Critical sources cited04
ControversiesSuperficial tableFull section with quotes

Key Findings

1. Research Quality Was Excellent

All three pipelines found diverse source types:

LessWrong sources found:

  • Wikipedia article
  • Official LessWrong posts (history, surveys, FAQ)
  • EA Forum discussions
  • Critical perspectives: Bryan Caplan (Econlib), Tyler Cowen, Greg Epstein (NYT), RationalWiki

Anthropic sources found:

  • Wikipedia, official company page
  • Financial data (valuations, revenue)
  • Critical: SaferAI critique, White House feud coverage, deceptive AI behavior reports
  • Policy positions on SB 1047, export controls

2. The Review Phase Catches Real Problems

The Anthropic critical-review phase identified:

"Quick Assessment table is overwhelmingly favorable" "Company culture section reads like PR" "Several interpretive statements presented as fact without sources" "Missing: lobbying positions, concrete safety failures, competitor comparisons"

The gap-fill phase then researched exactly those topics and the rewrite integrated them.

3. Standard Tier Hits Diminishing Returns

TierCostQualityNotes
Budget$4.5075Gaps identified but not fixed
Standard$10.5078Gaps fixed
Premium$15.0078Same quality, more thorough

The extra $4.50 from Standard to Premium didn't improve the quality score. The review→gap-fill→polish cycle is where the value is.

4. Prose-First Prompting Works

Explicit instructions matter:

  • "Maximum 4 tables"
  • "Minimum 60% prose"
  • "Tables are for genuinely comparative data, not lists"

Result: 97% reduction in table rows.


Recommendations

For This Wiki

  1. Use Standard tier ($10.50) for most pages - Best quality/cost ratio
  2. Use Budget tier ($4.50) for drafts - Good starting point for human editing
  3. Reserve Premium ($15) for controversial topics - Extra scrutiny is valuable for Anthropic, OpenAI, etc.

For the Pipeline

  1. Add to package.json for easy access:

    "scripts": {
      "create-page": "node scripts/content/page-creator-v2.mjs"
    }
    
  2. Consider batch mode - Run multiple Standard-tier pages overnight

  3. Integrate with grading - Auto-grade output and re-run if below threshold

For Future Work

  1. Perplexity Deep Research integration - Could improve research phase
  2. Human-in-the-loop review - Show review.json before gap-fill for approval
  3. Incremental updates - Re-run pipeline on existing pages to improve them

Appendix: Files Created

scripts/content/page-creator-v2.mjs    # The pipeline script

.claude/temp/page-creator/
├── miri/
│   ├── sources.json      # 10 sources
│   ├── facts.json        # 18 facts, 20 stats, 8 controversies
│   ├── draft.mdx         # Final output (budget has no polish)
│   └── review.json       # Identified but unfixed gaps
├── lesswrong/
│   ├── sources.json      # Research results
│   ├── facts.json        # Extracted claims
│   ├── draft.mdx         # Initial synthesis
│   ├── review.json       # Gap analysis
│   ├── additional-facts.json  # Gap-fill results
│   ├── final.mdx         # Polished output
│   └── summary.json      # Quality metrics
└── anthropic/
    └── [same structure as lesswrong]

Usage

# Standard tier (recommended)
node scripts/content/page-creator-v2.mjs "Topic Name" --tier standard

# Budget tier (for drafts)
node scripts/content/page-creator-v2.mjs "Topic Name" --tier budget

# Premium tier (for controversial topics)
node scripts/content/page-creator-v2.mjs "Topic Name" --tier premium

# Copy output to specific location
node scripts/content/page-creator-v2.mjs "Topic Name" --tier standard --output ./my-article.mdx

Conclusion

The research-first pipeline successfully addresses the core problem of AI-generated content: unsourced, table-heavy data dumps. By structuring the process as Research → Extract → Synthesize with explicit citation requirements, we produce articles that are:

  • Well-sourced (42 citations with URLs)
  • Readable (90% prose, not tables)
  • Balanced (includes critical perspectives)
  • Cost-effective ($10.50 for production quality)

The Standard tier is recommended for most use cases. The key insight is that research quality matters more than generation quality - you can't synthesize what you haven't found.