Content Database System
This project includes a SQLite-based content database for managing article metadata, tracking external source references, and generating AI-powered summaries. The database is stored locally in .cache/knowledge.db (gitignored) and serves as a foundation for content quality tooling.
Quick Start
# Scan all content files and populate database
npm run crux -- analyze scan
# Generate AI summaries for articles
npm run crux -- generate summaries
# Export sources to resources.yaml
node scripts/utils/export-resources.mjs
# View database statistics
node scripts/scan-content.mjs --stats
Requires ANTHROPIC_API_KEY in .env file for summary generation.
Architecture
Database Schema
articles
Stores content extracted from MDX files.
| Column | Type | Description |
|---|---|---|
id | TEXT PRIMARY KEY | Entity ID from filename |
path | TEXT | Relative path to source file |
title | TEXT | Article title from frontmatter |
description | TEXT | Article description |
content | TEXT | Plain text content (JSX removed) |
word_count | INTEGER | Word count for prioritization |
quality | INTEGER | Quality rating from frontmatter |
content_hash | TEXT | MD5 hash for change detection |
created_at | TEXT | When article was first indexed |
updated_at | TEXT | When article was last updated |
sources
Stores metadata about external references discovered in articles.
| Column | Type | Description |
|---|---|---|
id | TEXT PRIMARY KEY | SHA256 hash of URL/DOI (16 chars) |
url | TEXT | Full URL of external source |
doi | TEXT | Digital Object Identifier (if paper) |
title | TEXT | Source title/headline |
authors | TEXT (JSON) | Array of author names |
year | INTEGER | Publication year |
source_type | TEXT | Type: paper, blog, report, web, etc. |
content | TEXT | Fetched source content |
fetch_status | TEXT | pending, fetched, failed, manual |
fetched_at | TEXT | When source was last fetched |
article_sources
Junction table linking articles to their cited sources.
| Column | Type | Description |
|---|---|---|
article_id | TEXT | Foreign key to articles |
source_id | TEXT | Foreign key to sources |
citation_context | TEXT | Quote where source is cited |
summaries
Stores AI-generated summaries of articles and sources.
| Column | Type | Description |
|---|---|---|
entity_id | TEXT PRIMARY KEY | ID of summarized entity |
entity_type | TEXT | 'article' or 'source' |
one_liner | TEXT | Single-sentence summary (max 25 words) |
summary | TEXT | 2-3 paragraph summary |
key_points | TEXT (JSON) | 3-5 bullet points |
key_claims | TEXT (JSON) | Claims with values |
model | TEXT | Model used (e.g., claude-3-5-haiku) |
tokens_used | INTEGER | Total tokens consumed |
generated_at | TEXT | When summary was generated |
entity_relations
Entity relationships loaded from entities.yaml.
| Column | Type | Description |
|---|---|---|
from_id | TEXT | Source entity ID |
to_id | TEXT | Target entity ID |
relationship | TEXT | Relationship type |
Commands Reference
Scan Content
Scans all MDX files and populates the database.
# Standard scan (skips unchanged files)
npm run crux -- analyze scan
# Force rescan all files
node scripts/scan-content.mjs --force
# Show per-file progress
node scripts/scan-content.mjs --verbose
# Show database stats only
node scripts/scan-content.mjs --stats
What it does:
- Finds all
.mdxand.mdfiles insrc/content/docs/ - Extracts frontmatter (title, description, quality, sources)
- Extracts plain text content (removes imports, JSX, HTML comments)
- Discovers URLs from markdown links and DOIs
- Infers source types from domains (arxiv.org → paper, lesswrong.com → blog)
- Loads entity relations from
entities.yaml - Skips unchanged files via content hash comparison
Generate Summaries
Generates AI summaries using the Anthropic API.
# Summarize 10 articles (default)
npm run crux -- generate summaries
# Summarize specific count
npm run crux -- generate summaries --batch 50
# Use higher-quality model
npm run crux -- generate summaries --model sonnet
# Summarize specific article
npm run crux -- generate summaries --id deceptive-alignment
# Re-summarize changed content
npm run crux -- generate summaries --resummary
# Preview without API calls
npm run crux -- generate summaries --dry-run
Models available:
| Model | ID | Cost (per 1M input tokens) | Use case |
|---|---|---|---|
| haiku | claude-haiku-4-5-20251001 | ≈$0.25 | Bulk summarization |
| sonnet | claude-sonnet-4-5-20250929 | ≈$3.00 | Higher quality |
| opus | claude-opus-4-6 | ≈$5.00 | Best quality |
Cost estimates:
| Task | Model | Estimated Cost |
|---|---|---|
| Summarize 311 articles | Haiku | ≈$2-3 |
| Summarize 793 sources | Haiku | ≈$10-15 |
| Single article improvement | Sonnet | ≈$0.20 |
Database Statistics
Display database statistics.
node scripts/scan-content.mjs --stats
Core Module API
The database is accessed via scripts/lib/knowledge-db.mjs.
Import
articles API
// Get article by ID
const article = articles.get('deceptive-alignment');
// Get article with its summary
const withSummary = articles.getWithSummary('deceptive-alignment');
// Get all articles
const all = articles.getAll();
// Find articles needing summaries
const unsummarized = articles.needingSummary();
// Search articles
const results = articles.search('reward hacking');
// Check if content changed
const changed = articles.hasChanged('id', newHash);
// Insert/update article
articles.upsert({
id: 'my-article',
path: 'knowledge-base/risks/my-article.mdx',
title: 'My Article',
description: 'Description here',
content: 'Full text content...',
word_count: 1500,
quality: 3,
content_hash: 'abc123...'
});
sources API
// Get source by ID or URL
const source = sources.get('abc123def456');
const byUrl = sources.getByUrl('https://arxiv.org/...');
// Get sources for an article
const articleSources = sources.getForArticle('deceptive-alignment');
// Get pending sources for fetching
const pending = sources.getPending(100);
// Link source to article
sources.linkToArticle('article-id', 'source-id', 'citation context');
// Mark source fetch status
sources.markFetched('source-id', 'content...');
sources.markFailed('source-id', 'Error message');
// Get statistics
const stats = sources.stats();
// { total: 793, pending: 793, fetched: 0, failed: 0, manual: 0 }
summaries API
// Get summary by entity ID
const summary = summaries.get('deceptive-alignment');
// Get all summaries of a type
const articleSummaries = summaries.getAll('article');
// Insert/update summary
summaries.upsert('entity-id', 'article', {
oneLiner: 'Single sentence...',
summary: 'Full summary...',
keyPoints: ['Point 1', 'Point 2'],
keyClaims: [{ claim: '...', value: '...' }],
model: 'claude-haiku-4-5-20251001',
tokensUsed: 1247
});
// Get statistics
const stats = summaries.stats();
// { article: { count: 311, tokens: 387000 }, source: { count: 0, tokens: 0 } }
Research Context
Get comprehensive context for improving an article:
const context = getResearchContext('deceptive-alignment');
// Returns:
// {
// article: { ...article, summary: {...} },
// relatedArticles: [...],
// sources: [...],
// claims: [...],
// stats: { relatedCount, sourcesTotal, sourcesFetched, claimsCount }
// }
Source Type Inference
When scanning content, source types are inferred from domains:
| Domain Pattern | Inferred Type |
|---|---|
| arxiv.org, doi.org, nature.com | paper |
| lesswrong.com, alignmentforum.org | blog |
| substack.com | blog |
| .gov, congress.gov, whitehouse.gov | government |
| wikipedia.org | reference |
| .pdf (any domain) | report |
| (default) | web |
Directory Structure
project/
├── .cache/ # Gitignored
│ ├── knowledge.db # SQLite database
│ └── sources/ # Cached source documents
│ ├── pdf/
│ ├── html/
│ └── text/
├── scripts/
│ ├── lib/
│ │ ├── knowledge-db.mjs # Core DB module
│ │ ├── file-utils.mjs # File discovery
│ │ ├── mdx-utils.mjs # MDX parsing
│ │ └── output.mjs # Terminal formatting
│ ├── scan-content.mjs # Content scanner
│ └── generate-summaries.mjs # AI summarization
├── src/content/docs/ # Source MDX files
└── .env # API credentials
Workflow Examples
After Editing Content
# 1. Scan for changes (fast, uses hash comparison)
npm run crux -- analyze scan
# 2. Generate summaries for new/changed articles
npm run crux -- generate summaries --resummary
Bulk Initial Setup
# 1. Scan all content
node scripts/scan-content.mjs --force
# 2. Generate summaries in batches
npm run crux -- generate summaries --batch 100
npm run crux -- generate summaries --batch 100
npm run crux -- generate summaries --batch 100
# ... repeat until done
Check Database State
# View statistics
node scripts/scan-content.mjs --stats
# Or programmatically
node -e "
console.log(JSON.stringify(getStats(), null, 2));
"
Exporting to YAML (Resources System)
The database serves as a processing layer. Canonical data is exported to YAML files for the site build.
Export Resources
# Export cited sources to resources.yaml
node scripts/export-resources.mjs
# Export ALL sources (including uncited)
node scripts/export-resources.mjs --all
# Preview without writing
node scripts/export-resources.mjs --dry-run
The export script:
- Reads sources from SQLite with their AI summaries
- Merges with existing
src/data/resources.yaml(preserves manual edits) - Includes
cited_byreferences showing which articles cite each source
Using Resources in MDX
Once resources are in resources.yaml, you can reference them semantically:
Recent work on AI control <ResourceCite id="ai-control-2023" /> shows...
See also: <ResourceLink id="superintelligence-2014" />
## Key Papers
<ResourceList
ids={["ai-control-2023", "concrete-problems-2016"]}
showSummaries
/>
Resource Schema
Resources in resources.yaml have this structure:
- id: ai-control-2023
url: https://arxiv.org/abs/2312.06942
title: "AI Control: Improving Safety..."
authors: ["Ryan Greenblatt", "Buck Shlegeris"]
published_date: "2023-12"
type: paper # paper, blog, book, report, talk, podcast, government, reference, web
summary: "AI-generated summary..."
key_points:
- "Point 1"
- "Point 2"
cited_by:
- agentic-ai
- ai-control
Source Fetching
The system now includes automatic source fetching via the Firecrawl API. This enables:
- Citation verification - Check that quoted text actually appears in sources
- Content caching - Store fetched pages for offline access
- Quote validation - Catch hallucinated quotes during page creation
Standalone Fetching
# Fetch pending sources (up to 20)
node scripts/utils/fetch-sources.mjs --batch 20
# Fetch a specific source by ID
node scripts/utils/fetch-sources.mjs --id abc123def456
# Retry failed sources
node scripts/utils/fetch-sources.mjs --retry-failed
# Preview without fetching
node scripts/utils/fetch-sources.mjs --dry-run
Integrated with Page Creation
The page creator pipeline automatically:
- Extracts citation URLs from Perplexity research
- Registers them in the sources table
- Fetches content via Firecrawl (up to 15 per page)
- Uses fetched content for quote verification
# Standard tier includes source fetching
node scripts/content/page-creator.mjs "Topic" --tier standard
Rate Limiting
Firecrawl free tier allows ~10 requests/minute. The system waits 7 seconds between requests to stay under limits.
Fetch Status Values
| Status | Meaning |
|---|---|
pending | Not yet attempted |
fetched | Successfully retrieved and stored |
failed | Fetch attempted but failed (error stored) |
manual | Requires manual handling (paywalled, etc.) |
Limitations
- Claims extraction minimal: The claims table exists but extraction is not fully implemented
- Local only: Database is gitignored and must be regenerated on each machine
- No incremental summary updates: Summaries are regenerated from scratch, not updated
- Firecrawl API key required: Source fetching requires
FIRECRAWL_KEYin.env
Future Enhancements
Potential improvements to the system:
- PDF parsing for academic papers
- Claims extraction and consistency checking across articles
- Similarity search using embeddings
- Migration of entity
sourcesto use resource IDs - Integration with content validation tools