Page StatusContent

Edited 10 days ago1.1k words

Summary

Documentation for an internal SQLite-based content management system that indexes MDX articles, tracks citations to external sources, and generates AI summaries using Claude models. The system provides CLI tools for scanning content (~311 articles), generating summaries (estimated $2-3 for full corpus using Haiku), and exporting data to YAML for site builds.

Content Database System

This project includes a SQLite-based content database for managing article metadata, tracking external source references, and generating AI-powered summaries. The database is stored locally in .cache/knowledge.db (gitignored) and serves as a foundation for content quality tooling.

Quick Start

# Scan all content files and populate database
npm run crux -- analyze scan

# Generate AI summaries for articles
npm run crux -- generate summaries

# Export sources to resources.yaml
node scripts/utils/export-resources.mjs

# View database statistics
node scripts/scan-content.mjs --stats

Note

Requires ANTHROPIC_API_KEY in .env file for summary generation.

Architecture

Loading diagram...

Database Schema

articles

Stores content extracted from MDX files.

Column	Type	Description
`id`	TEXT PRIMARY KEY	Entity ID from filename
`path`	TEXT	Relative path to source file
`title`	TEXT	Article title from frontmatter
`description`	TEXT	Article description
`content`	TEXT	Plain text content (JSX removed)
`word_count`	INTEGER	Word count for prioritization
`quality`	INTEGER	Quality rating from frontmatter
`content_hash`	TEXT	MD5 hash for change detection
`created_at`	TEXT	When article was first indexed
`updated_at`	TEXT	When article was last updated

sources

Stores metadata about external references discovered in articles.

Column	Type	Description
`id`	TEXT PRIMARY KEY	SHA256 hash of URL/DOI (16 chars)
`url`	TEXT	Full URL of external source
`doi`	TEXT	Digital Object Identifier (if paper)
`title`	TEXT	Source title/headline
`authors`	TEXT (JSON)	Array of author names
`year`	INTEGER	Publication year
`source_type`	TEXT	Type: paper, blog, report, web, etc.
`content`	TEXT	Fetched source content
`fetch_status`	TEXT	pending, fetched, failed, manual
`fetched_at`	TEXT	When source was last fetched

article_sources

Junction table linking articles to their cited sources.

Column	Type	Description
`article_id`	TEXT	Foreign key to articles
`source_id`	TEXT	Foreign key to sources
`citation_context`	TEXT	Quote where source is cited

summaries

Stores AI-generated summaries of articles and sources.

Column	Type	Description
`entity_id`	TEXT PRIMARY KEY	ID of summarized entity
`entity_type`	TEXT	'article' or 'source'
`one_liner`	TEXT	Single-sentence summary (max 25 words)
`summary`	TEXT	2-3 paragraph summary
`key_points`	TEXT (JSON)	3-5 bullet points
`key_claims`	TEXT (JSON)	Claims with values
`model`	TEXT	Model used (e.g., claude-3-5-haiku)
`tokens_used`	INTEGER	Total tokens consumed
`generated_at`	TEXT	When summary was generated

entity_relations

Entity relationships loaded from entities.yaml.

Column	Type	Description
`from_id`	TEXT	Source entity ID
`to_id`	TEXT	Target entity ID
`relationship`	TEXT	Relationship type

Commands Reference

Scan Content

Scans all MDX files and populates the database.

# Standard scan (skips unchanged files)
npm run crux -- analyze scan

# Force rescan all files
node scripts/scan-content.mjs --force

# Show per-file progress
node scripts/scan-content.mjs --verbose

# Show database stats only
node scripts/scan-content.mjs --stats

What it does:

Finds all .mdx and .md files in src/content/docs/
Extracts frontmatter (title, description, quality, sources)
Extracts plain text content (removes imports, JSX, HTML comments)
Discovers URLs from markdown links and DOIs
Infers source types from domains (arxiv.org → paper, lesswrong.com → blog)
Loads entity relations from entities.yaml
Skips unchanged files via content hash comparison

Generate Summaries

Generates AI summaries using the Anthropic API.

# Summarize 10 articles (default)
npm run crux -- generate summaries

# Summarize specific count
npm run crux -- generate summaries --batch 50

# Use higher-quality model
npm run crux -- generate summaries --model sonnet

# Summarize specific article
npm run crux -- generate summaries --id deceptive-alignment

# Re-summarize changed content
npm run crux -- generate summaries --resummary

# Preview without API calls
npm run crux -- generate summaries --dry-run

Models available:

Model	ID	Cost (per 1M input tokens)	Use case
haiku	claude-haiku-4-5-20251001	≈$0.25	Bulk summarization
sonnet	claude-sonnet-4-5-20250929	≈$3.00	Higher quality
opus	claude-opus-4-6	≈$5.00	Best quality

Cost estimates:

Task	Model	Estimated Cost
Summarize 311 articles	Haiku	≈$2-3
Summarize 793 sources	Haiku	≈$10-15
Single article improvement	Sonnet	≈$0.20

Database Statistics

Display database statistics.

node scripts/scan-content.mjs --stats

Core Module API

The database is accessed via scripts/lib/knowledge-db.mjs.

Import

articles API

// Get article by ID
const article = articles.get('deceptive-alignment');

// Get article with its summary
const withSummary = articles.getWithSummary('deceptive-alignment');

// Get all articles
const all = articles.getAll();

// Find articles needing summaries
const unsummarized = articles.needingSummary();

// Search articles
const results = articles.search('reward hacking');

// Check if content changed
const changed = articles.hasChanged('id', newHash);

// Insert/update article
articles.upsert({
  id: 'my-article',
  path: 'knowledge-base/risks/my-article.mdx',
  title: 'My Article',
  description: 'Description here',
  content: 'Full text content...',
  word_count: 1500,
  quality: 3,
  content_hash: 'abc123...'
});

sources API

// Get source by ID or URL
const source = sources.get('abc123def456');
const byUrl = sources.getByUrl('https://arxiv.org/...');

// Get sources for an article
const articleSources = sources.getForArticle('deceptive-alignment');

// Get pending sources for fetching
const pending = sources.getPending(100);

// Link source to article
sources.linkToArticle('article-id', 'source-id', 'citation context');

// Mark source fetch status
sources.markFetched('source-id', 'content...');
sources.markFailed('source-id', 'Error message');

// Get statistics
const stats = sources.stats();
// { total: 793, pending: 793, fetched: 0, failed: 0, manual: 0 }

summaries API

// Get summary by entity ID
const summary = summaries.get('deceptive-alignment');

// Get all summaries of a type
const articleSummaries = summaries.getAll('article');

// Insert/update summary
summaries.upsert('entity-id', 'article', {
  oneLiner: 'Single sentence...',
  summary: 'Full summary...',
  keyPoints: ['Point 1', 'Point 2'],
  keyClaims: [{ claim: '...', value: '...' }],
  model: 'claude-haiku-4-5-20251001',
  tokensUsed: 1247
});

// Get statistics
const stats = summaries.stats();
// { article: { count: 311, tokens: 387000 }, source: { count: 0, tokens: 0 } }

Research Context

Get comprehensive context for improving an article:

const context = getResearchContext('deceptive-alignment');
// Returns:
// {
//   article: { ...article, summary: {...} },
//   relatedArticles: [...],
//   sources: [...],
//   claims: [...],
//   stats: { relatedCount, sourcesTotal, sourcesFetched, claimsCount }
// }

Source Type Inference

When scanning content, source types are inferred from domains:

Domain Pattern	Inferred Type
arxiv.org, doi.org, nature.com	paper
lesswrong.com, alignmentforum.org	blog
substack.com	blog
.gov, congress.gov, whitehouse.gov	government
wikipedia.org	reference
.pdf (any domain)	report
(default)	web

Directory Structure

project/
├── .cache/                         # Gitignored
│   ├── knowledge.db               # SQLite database
│   └── sources/                   # Cached source documents
│       ├── pdf/
│       ├── html/
│       └── text/
├── scripts/
│   ├── lib/
│   │   ├── knowledge-db.mjs       # Core DB module
│   │   ├── file-utils.mjs         # File discovery
│   │   ├── mdx-utils.mjs          # MDX parsing
│   │   └── output.mjs             # Terminal formatting
│   ├── scan-content.mjs           # Content scanner
│   └── generate-summaries.mjs     # AI summarization
├── src/content/docs/              # Source MDX files
└── .env                           # API credentials

Workflow Examples

After Editing Content

# 1. Scan for changes (fast, uses hash comparison)
npm run crux -- analyze scan

# 2. Generate summaries for new/changed articles
npm run crux -- generate summaries --resummary

Bulk Initial Setup

# 1. Scan all content
node scripts/scan-content.mjs --force

# 2. Generate summaries in batches
npm run crux -- generate summaries --batch 100
npm run crux -- generate summaries --batch 100
npm run crux -- generate summaries --batch 100
# ... repeat until done

Check Database State

# View statistics
node scripts/scan-content.mjs --stats

# Or programmatically
node -e "

console.log(JSON.stringify(getStats(), null, 2));
"

Exporting to YAML (Resources System)

The database serves as a processing layer. Canonical data is exported to YAML files for the site build.

Export Resources

# Export cited sources to resources.yaml
node scripts/export-resources.mjs

# Export ALL sources (including uncited)
node scripts/export-resources.mjs --all

# Preview without writing
node scripts/export-resources.mjs --dry-run

The export script:

Reads sources from SQLite with their AI summaries
Merges with existing src/data/resources.yaml (preserves manual edits)
Includes cited_by references showing which articles cite each source

Using Resources in MDX

Once resources are in resources.yaml, you can reference them semantically:


Recent work on AI control <ResourceCite id="ai-control-2023" /> shows...

See also: <ResourceLink id="superintelligence-2014" />

## Key Papers

<ResourceList
  ids={["ai-control-2023", "concrete-problems-2016"]}
  showSummaries
/>

Resource Schema

Resources in resources.yaml have this structure:

- id: ai-control-2023
  url: https://arxiv.org/abs/2312.06942
  title: "AI Control: Improving Safety..."
  authors: ["Ryan Greenblatt", "Buck Shlegeris"]
  published_date: "2023-12"
  type: paper  # paper, blog, book, report, talk, podcast, government, reference, web
  summary: "AI-generated summary..."
  key_points:
    - "Point 1"
    - "Point 2"
  cited_by:
    - agentic-ai
    - ai-control

Source Fetching

The system now includes automatic source fetching via the Firecrawl API. This enables:

Citation verification - Check that quoted text actually appears in sources
Content caching - Store fetched pages for offline access
Quote validation - Catch hallucinated quotes during page creation

Standalone Fetching

# Fetch pending sources (up to 20)
node scripts/utils/fetch-sources.mjs --batch 20

# Fetch a specific source by ID
node scripts/utils/fetch-sources.mjs --id abc123def456

# Retry failed sources
node scripts/utils/fetch-sources.mjs --retry-failed

# Preview without fetching
node scripts/utils/fetch-sources.mjs --dry-run

Integrated with Page Creation

The page creator pipeline automatically:

Extracts citation URLs from Perplexity research
Registers them in the sources table
Fetches content via Firecrawl (up to 15 per page)
Uses fetched content for quote verification

# Standard tier includes source fetching
node scripts/content/page-creator.mjs "Topic" --tier standard

Rate Limiting

Firecrawl free tier allows ~10 requests/minute. The system waits 7 seconds between requests to stay under limits.

Fetch Status Values

Status	Meaning
`pending`	Not yet attempted
`fetched`	Successfully retrieved and stored
`failed`	Fetch attempted but failed (error stored)
`manual`	Requires manual handling (paywalled, etc.)

Limitations

Claims extraction minimal: The claims table exists but extraction is not fully implemented
Local only: Database is gitignored and must be regenerated on each machine
No incremental summary updates: Summaries are regenerated from scratch, not updated
Firecrawl API key required: Source fetching requires FIRECRAWL_KEY in .env

Future Enhancements

Potential improvements to the system:

PDF parsing for academic papers
Claims extraction and consistency checking across articles
Similarity search using embeddings
Migration of entity sources to use resource IDs
Integration with content validation tools