Longterm Wiki
Updated 2026-02-04History
Page StatusDocumentation
Edited 10 days ago958 words
11
Structure11/15
75700%13%

System Architecture

This document provides a technical overview of how the Longterm Wiki is built, how data flows through the system, and the design rationale behind key architectural decisions.

Keeping This Updated

This page documents system architecture. When making significant changes to pipelines or data flow, update the relevant sections here. See Documentation Maintenance for guidelines.


High-Level Architecture

Loading diagram...

Core Systems

1. Entity Data System

Purpose: Maintain structured data about people, organizations, and concepts that can be referenced across pages.

Flow: YAML sources → build-data.mjs → JSON artifacts → React components

ComponentLocationPurpose
Source YAMLsrc/data/*.yamlHuman-editable entity definitions
Build scriptscripts/build-data.mjsCompiles YAML to JSON
Generated JSONsrc/data/*.jsonBrowser-ready data
Componentssrc/components/wiki/Display entity data

Key files generated:

  • database.json - All entities merged
  • pathRegistry.json - Entity ID → URL path mapping
  • backlinks.json - Reverse reference indices

Design decision: YAML for human editing, JSON for runtime. This separation allows manual curation while keeping the site fast.


2. Knowledge Database (SQLite)

Purpose: Index content for analysis, cache external sources, and support AI-assisted workflows.

Location: .cache/knowledge.db (gitignored, regenerated per machine)

Loading diagram...

Key capabilities:

  • Content indexing and search
  • Source fetching via Firecrawl API
  • AI summary generation
  • Change detection via content hashing

See: Content Database for full schema and API reference.


3. Page Creation Pipeline

Purpose: Generate new wiki pages with proper citations using AI research and synthesis.

Pipeline phases:

canonical-links → research-perplexity → register-sources → fetch-sources
    → research-scry → synthesize → verify-sources → validate-loop → grade
Loading diagram...

Design decisions:

DecisionRationale
Perplexity for researchCheap ($0.10), good at web search, provides citation URLs
Register + fetch sourcesEnables quote verification against actual content
Verify-sources phaseCatches hallucinated quotes before publication
Validation loopIterative fixing ensures build-passing output

Cost tiers:

  • Budget: $2-3 (no source fetching)
  • Standard: $4-6 (with source fetching + verification)
  • Premium: $8-12 (deep research + review)

See: Page Creator Pipeline for experiment results.


4. Source Fetching System

Purpose: Fetch and cache actual webpage content for citation verification.

Flow:

Citation URLs (from Perplexity)
    → Register in SQLite (sources table)
    → Fetch via Firecrawl API
    → Store in SQLite + .cache/sources/
    → Use in quote verification

Components:

ComponentLocationPurpose
Knowledge DBscripts/lib/knowledge-db.mjsSQLite wrapper, source tracking
Fetch scriptscripts/utils/fetch-sources.mjsStandalone Firecrawl fetcher
Page creatorscripts/content/page-creator.mjsIntegrated fetch during page creation

Rate limiting: 7 seconds between requests (Firecrawl free tier limit).

Design decision: Fetch during page creation rather than eagerly. This keeps costs predictable and ensures we only fetch sources we actually need.


5. Validation System

Purpose: Ensure content quality and prevent build failures.

Architecture: Unified rules engine with 20+ validators.

npm run crux -- validate unified --rules=dollar-signs,entitylink-ids

Rule categories:

CategoryExamplesBlocking?
Criticaldollar-signs, entitylink-ids, fake-urlsYes - breaks build
Qualitytilde-dollar, markdown-lists, placeholdersNo - warnings only

Design decision: Two-tier validation allows fast feedback while still catching serious issues. Critical rules run in CI; quality rules are advisory.


Data Flow Diagrams

Page Creation Data Flow

Loading diagram...

Entity Resolution Flow

Loading diagram...

Design Principles

1. Separation of Source and Generated Data

Principle: Human-editable files (YAML, MDX) are separate from generated artifacts (JSON, SQLite).

Why:

  • Generated files can be regenerated from source
  • No merge conflicts on generated files (gitignored where appropriate)
  • Clear ownership: humans edit YAML, scripts generate JSON

2. Local-First Caching

Principle: Cache computationally expensive results locally (SQLite, .cache/).

Why:

  • AI summaries are expensive; don't regenerate unnecessarily
  • Source fetching has API costs; cache results
  • Content hashing enables incremental updates

Trade-off: Cache must be rebuilt on new machines. This is acceptable because:

  • Build is deterministic from source files
  • Cache is optimization, not source of truth

3. Progressive Enhancement for AI Features

Principle: AI features (summaries, page creation) are optional enhancements.

Why:

  • Wiki works without API keys
  • Failures in AI pipeline don't break the site
  • Costs are predictable and opt-in

4. Validation at Multiple Levels

Principle: Catch errors early and at appropriate granularity.

LevelToolWhen
SyntaxMDX compilerBuild time
SchemaZod validationBuild time
ReferencesEntityLink validatorCI
QualityGrading pipelineManual trigger

Key Configuration Files

FilePurposeWhen to Edit
astro.config.mjsSidebar structure, Starlight configAdding new sections
src/content.config.tsMDX frontmatter schemaAdding frontmatter fields
src/data/schema.tsEntity type definitions (Zod)Adding entity types
scripts/lib/knowledge-db.mjsSQLite schemaAdding database tables
scripts/content/page-creator.mjsPage creation pipelineModifying generation flow

Environment Variables

VariablePurposeRequired For
ANTHROPIC_API_KEYClaude API accessSummaries, grading, page creation
OPENROUTER_API_KEYPerplexity via OpenRouterPage creation research
FIRECRAWL_KEYWeb page fetchingSource content fetching

All are optional. Features gracefully degrade when keys are missing.


Documentation Maintenance

Keeping Docs Updated

This architecture documentation should be updated when:

  1. New pipeline phases added - Update the pipeline diagram and phase list
  2. Database schema changes - Update the ER diagram
  3. New environment variables - Add to the environment variables table
  4. New validation rules - Document in the validation section

Freshness Indicators

Each internal doc should include:

  • lastEdited in frontmatter (updated when content changes)
  • Verification notes for time-sensitive information

Automated Checks

Consider adding:

# Check if docs mention deprecated scripts
npm run crux -- validate docs-freshness

Cross-References

When code changes affect documentation:

  1. Update the relevant internal doc
  2. Add a comment in the code: // Docs: /internal/architecture/#section-name
  3. Run npm run build to verify links still work

Related Documentation

Related Pages

Top Related Pages

Concepts

Documentation Maintenance