Longterm Wiki
Updated 2026-02-04History
Page StatusContent
Edited 10 days ago1.1k words
44
QualityAdequate
8
Structure8/15
91000%13%
Summary

Documentation for an internal SQLite-based content management system that indexes MDX articles, tracks citations to external sources, and generates AI summaries using Claude models. The system provides CLI tools for scanning content (~311 articles), generating summaries (estimated $2-3 for full corpus using Haiku), and exporting data to YAML for site builds.

Content Database System

This project includes a SQLite-based content database for managing article metadata, tracking external source references, and generating AI-powered summaries. The database is stored locally in .cache/knowledge.db (gitignored) and serves as a foundation for content quality tooling.


Quick Start

# Scan all content files and populate database
npm run crux -- analyze scan

# Generate AI summaries for articles
npm run crux -- generate summaries

# Export sources to resources.yaml
node scripts/utils/export-resources.mjs

# View database statistics
node scripts/scan-content.mjs --stats
Note

Requires ANTHROPIC_API_KEY in .env file for summary generation.


Architecture

Loading diagram...

Database Schema

articles

Stores content extracted from MDX files.

ColumnTypeDescription
idTEXT PRIMARY KEYEntity ID from filename
pathTEXTRelative path to source file
titleTEXTArticle title from frontmatter
descriptionTEXTArticle description
contentTEXTPlain text content (JSX removed)
word_countINTEGERWord count for prioritization
qualityINTEGERQuality rating from frontmatter
content_hashTEXTMD5 hash for change detection
created_atTEXTWhen article was first indexed
updated_atTEXTWhen article was last updated

sources

Stores metadata about external references discovered in articles.

ColumnTypeDescription
idTEXT PRIMARY KEYSHA256 hash of URL/DOI (16 chars)
urlTEXTFull URL of external source
doiTEXTDigital Object Identifier (if paper)
titleTEXTSource title/headline
authorsTEXT (JSON)Array of author names
yearINTEGERPublication year
source_typeTEXTType: paper, blog, report, web, etc.
contentTEXTFetched source content
fetch_statusTEXTpending, fetched, failed, manual
fetched_atTEXTWhen source was last fetched

article_sources

Junction table linking articles to their cited sources.

ColumnTypeDescription
article_idTEXTForeign key to articles
source_idTEXTForeign key to sources
citation_contextTEXTQuote where source is cited

summaries

Stores AI-generated summaries of articles and sources.

ColumnTypeDescription
entity_idTEXT PRIMARY KEYID of summarized entity
entity_typeTEXT'article' or 'source'
one_linerTEXTSingle-sentence summary (max 25 words)
summaryTEXT2-3 paragraph summary
key_pointsTEXT (JSON)3-5 bullet points
key_claimsTEXT (JSON)Claims with values
modelTEXTModel used (e.g., claude-3-5-haiku)
tokens_usedINTEGERTotal tokens consumed
generated_atTEXTWhen summary was generated

entity_relations

Entity relationships loaded from entities.yaml.

ColumnTypeDescription
from_idTEXTSource entity ID
to_idTEXTTarget entity ID
relationshipTEXTRelationship type

Commands Reference

Scan Content

Scans all MDX files and populates the database.

# Standard scan (skips unchanged files)
npm run crux -- analyze scan

# Force rescan all files
node scripts/scan-content.mjs --force

# Show per-file progress
node scripts/scan-content.mjs --verbose

# Show database stats only
node scripts/scan-content.mjs --stats

What it does:

  1. Finds all .mdx and .md files in src/content/docs/
  2. Extracts frontmatter (title, description, quality, sources)
  3. Extracts plain text content (removes imports, JSX, HTML comments)
  4. Discovers URLs from markdown links and DOIs
  5. Infers source types from domains (arxiv.org → paper, lesswrong.com → blog)
  6. Loads entity relations from entities.yaml
  7. Skips unchanged files via content hash comparison

Generate Summaries

Generates AI summaries using the Anthropic API.

# Summarize 10 articles (default)
npm run crux -- generate summaries

# Summarize specific count
npm run crux -- generate summaries --batch 50

# Use higher-quality model
npm run crux -- generate summaries --model sonnet

# Summarize specific article
npm run crux -- generate summaries --id deceptive-alignment

# Re-summarize changed content
npm run crux -- generate summaries --resummary

# Preview without API calls
npm run crux -- generate summaries --dry-run

Models available:

ModelIDCost (per 1M input tokens)Use case
haikuclaude-haiku-4-5-20251001≈$0.25Bulk summarization
sonnetclaude-sonnet-4-5-20250929≈$3.00Higher quality
opusclaude-opus-4-6≈$5.00Best quality

Cost estimates:

TaskModelEstimated Cost
Summarize 311 articlesHaiku≈$2-3
Summarize 793 sourcesHaiku≈$10-15
Single article improvementSonnet≈$0.20

Database Statistics

Display database statistics.

node scripts/scan-content.mjs --stats

Core Module API

The database is accessed via scripts/lib/knowledge-db.mjs.

Import

articles API

// Get article by ID
const article = articles.get('deceptive-alignment');

// Get article with its summary
const withSummary = articles.getWithSummary('deceptive-alignment');

// Get all articles
const all = articles.getAll();

// Find articles needing summaries
const unsummarized = articles.needingSummary();

// Search articles
const results = articles.search('reward hacking');

// Check if content changed
const changed = articles.hasChanged('id', newHash);

// Insert/update article
articles.upsert({
  id: 'my-article',
  path: 'knowledge-base/risks/my-article.mdx',
  title: 'My Article',
  description: 'Description here',
  content: 'Full text content...',
  word_count: 1500,
  quality: 3,
  content_hash: 'abc123...'
});

sources API

// Get source by ID or URL
const source = sources.get('abc123def456');
const byUrl = sources.getByUrl('https://arxiv.org/...');

// Get sources for an article
const articleSources = sources.getForArticle('deceptive-alignment');

// Get pending sources for fetching
const pending = sources.getPending(100);

// Link source to article
sources.linkToArticle('article-id', 'source-id', 'citation context');

// Mark source fetch status
sources.markFetched('source-id', 'content...');
sources.markFailed('source-id', 'Error message');

// Get statistics
const stats = sources.stats();
// { total: 793, pending: 793, fetched: 0, failed: 0, manual: 0 }

summaries API

// Get summary by entity ID
const summary = summaries.get('deceptive-alignment');

// Get all summaries of a type
const articleSummaries = summaries.getAll('article');

// Insert/update summary
summaries.upsert('entity-id', 'article', {
  oneLiner: 'Single sentence...',
  summary: 'Full summary...',
  keyPoints: ['Point 1', 'Point 2'],
  keyClaims: [{ claim: '...', value: '...' }],
  model: 'claude-haiku-4-5-20251001',
  tokensUsed: 1247
});

// Get statistics
const stats = summaries.stats();
// { article: { count: 311, tokens: 387000 }, source: { count: 0, tokens: 0 } }

Research Context

Get comprehensive context for improving an article:

const context = getResearchContext('deceptive-alignment');
// Returns:
// {
//   article: { ...article, summary: {...} },
//   relatedArticles: [...],
//   sources: [...],
//   claims: [...],
//   stats: { relatedCount, sourcesTotal, sourcesFetched, claimsCount }
// }

Source Type Inference

When scanning content, source types are inferred from domains:

Domain PatternInferred Type
arxiv.org, doi.org, nature.compaper
lesswrong.com, alignmentforum.orgblog
substack.comblog
.gov, congress.gov, whitehouse.govgovernment
wikipedia.orgreference
.pdf (any domain)report
(default)web

Directory Structure

project/
├── .cache/                         # Gitignored
│   ├── knowledge.db               # SQLite database
│   └── sources/                   # Cached source documents
│       ├── pdf/
│       ├── html/
│       └── text/
├── scripts/
│   ├── lib/
│   │   ├── knowledge-db.mjs       # Core DB module
│   │   ├── file-utils.mjs         # File discovery
│   │   ├── mdx-utils.mjs          # MDX parsing
│   │   └── output.mjs             # Terminal formatting
│   ├── scan-content.mjs           # Content scanner
│   └── generate-summaries.mjs     # AI summarization
├── src/content/docs/              # Source MDX files
└── .env                           # API credentials

Workflow Examples

After Editing Content

# 1. Scan for changes (fast, uses hash comparison)
npm run crux -- analyze scan

# 2. Generate summaries for new/changed articles
npm run crux -- generate summaries --resummary

Bulk Initial Setup

# 1. Scan all content
node scripts/scan-content.mjs --force

# 2. Generate summaries in batches
npm run crux -- generate summaries --batch 100
npm run crux -- generate summaries --batch 100
npm run crux -- generate summaries --batch 100
# ... repeat until done

Check Database State

# View statistics
node scripts/scan-content.mjs --stats

# Or programmatically
node -e "

console.log(JSON.stringify(getStats(), null, 2));
"

Exporting to YAML (Resources System)

The database serves as a processing layer. Canonical data is exported to YAML files for the site build.

Export Resources

# Export cited sources to resources.yaml
node scripts/export-resources.mjs

# Export ALL sources (including uncited)
node scripts/export-resources.mjs --all

# Preview without writing
node scripts/export-resources.mjs --dry-run

The export script:

  • Reads sources from SQLite with their AI summaries
  • Merges with existing src/data/resources.yaml (preserves manual edits)
  • Includes cited_by references showing which articles cite each source

Using Resources in MDX

Once resources are in resources.yaml, you can reference them semantically:


Recent work on AI control <ResourceCite id="ai-control-2023" /> shows...

See also: <ResourceLink id="superintelligence-2014" />

## Key Papers

<ResourceList
  ids={["ai-control-2023", "concrete-problems-2016"]}
  showSummaries
/>

Resource Schema

Resources in resources.yaml have this structure:

- id: ai-control-2023
  url: https://arxiv.org/abs/2312.06942
  title: "AI Control: Improving Safety..."
  authors: ["Ryan Greenblatt", "Buck Shlegeris"]
  published_date: "2023-12"
  type: paper  # paper, blog, book, report, talk, podcast, government, reference, web
  summary: "AI-generated summary..."
  key_points:
    - "Point 1"
    - "Point 2"
  cited_by:
    - agentic-ai
    - ai-control

Source Fetching

The system now includes automatic source fetching via the Firecrawl API. This enables:

  1. Citation verification - Check that quoted text actually appears in sources
  2. Content caching - Store fetched pages for offline access
  3. Quote validation - Catch hallucinated quotes during page creation

Standalone Fetching

# Fetch pending sources (up to 20)
node scripts/utils/fetch-sources.mjs --batch 20

# Fetch a specific source by ID
node scripts/utils/fetch-sources.mjs --id abc123def456

# Retry failed sources
node scripts/utils/fetch-sources.mjs --retry-failed

# Preview without fetching
node scripts/utils/fetch-sources.mjs --dry-run

Integrated with Page Creation

The page creator pipeline automatically:

  1. Extracts citation URLs from Perplexity research
  2. Registers them in the sources table
  3. Fetches content via Firecrawl (up to 15 per page)
  4. Uses fetched content for quote verification
# Standard tier includes source fetching
node scripts/content/page-creator.mjs "Topic" --tier standard

Rate Limiting

Firecrawl free tier allows ~10 requests/minute. The system waits 7 seconds between requests to stay under limits.

Fetch Status Values

StatusMeaning
pendingNot yet attempted
fetchedSuccessfully retrieved and stored
failedFetch attempted but failed (error stored)
manualRequires manual handling (paywalled, etc.)

Limitations

  1. Claims extraction minimal: The claims table exists but extraction is not fully implemented
  2. Local only: Database is gitignored and must be regenerated on each machine
  3. No incremental summary updates: Summaries are regenerated from scratch, not updated
  4. Firecrawl API key required: Source fetching requires FIRECRAWL_KEY in .env

Future Enhancements

Potential improvements to the system:

  • PDF parsing for academic papers
  • Claims extraction and consistency checking across articles
  • Similarity search using embeddings
  • Migration of entity sources to use resource IDs
  • Integration with content validation tools

Related Pages

Top Related Pages