Content Database System
The wiki uses a multi-layer storage architecture. There is no single database — different kinds of data live in the storage layer best suited for them.
Storage Layers
Diagram (loading…)
flowchart TD
subgraph Durable["Durable Storage"]
PG[("PostgreSQL
(wiki-server)")]
KB[("KB YAML
packages/kb/data/things/")]
YAML[("YAML Files
data/")]
MDX[("MDX Pages
content/docs/")]
end
subgraph Transient["Transient / Session"]
LRU["In-Memory LRU Cache
(500 entries, per-process)"]
SOURCES[".cache/sources/
(fetched HTML/text)"]
HASHES[".cache/content-hashes.json
(change detection)"]
end
subgraph Build["Build Artifacts"]
JSON[("database.json
YAML + MDX → JSON")]
end
KB --> JSON
YAML --> JSON
MDX --> JSON
PG -.->|"Hono RPC API"| LRU
LRU -.->|"cache miss"| PG1. PostgreSQL (wiki-server)
The wiki-server runs a PostgreSQL database that stores all structured data requiring durability and cross-machine access. This replaced the earlier local SQLite database (.cache/knowledge.db), which was retired in February 2026.
What it stores:
| Table | Purpose |
|---|---|
citation_content | Full text of fetched source URLs (for quote verification) |
citation_audits | Per-page citation verification results |
claims | Extracted atomic claims with source references |
facts | Canonical facts with values and computed expressions |
resources | External resource metadata (papers, blogs, reports) |
entities | Entity metadata synced from YAML |
agent_sessions | Claude Code session logs |
edit_logs | Per-page edit history |
hallucination_evals | Hallucination detection results |
Access pattern: All access goes through the wiki-server's Hono RPC API. CLI tools use apiRequest() from crux/lib/wiki-server/. The frontend uses typed RPC clients (e.g., getFactsRpcClient()).
# Example CLI commands that read/write PostgreSQL
pnpm crux citations verify <page-id> # Verify citations → writes audit results
pnpm crux query entity <id> # Read entity data
pnpm crux query search "topic" # Full-text search
2. In-Memory LRU Cache
Source fetching uses a session-scoped in-memory cache (crux/lib/citation-content-cache.ts) to avoid redundant network requests and database lookups within a single process.
| Property | Value |
|---|---|
| Max entries | 500 |
| Eviction | Least Recently Used |
| Scope | Per-process (cleared on exit) |
| Persistence | None — purely ephemeral |
When fetching a URL, the system checks:
- In-memory LRU cache (fastest)
- PostgreSQL
citation_contenttable (durable) - Network fetch via Firecrawl or built-in fallback (slowest)
Results are written back to both the LRU cache and PostgreSQL.
3. KB YAML (packages/kb/)
The Knowledge Base package (packages/kb/) is the authoritative source for structured entity facts — valuations, revenue, headcounts, founding dates, and other typed properties. As of March 2026, 9+ entities have been migrated here from the older data/facts/ system.
| Path | Content |
|---|---|
packages/kb/data/things/*.yaml | Entity facts with typed properties, time series, sources |
packages/kb/data/schemas/ | Property schemas (60 properties across orgs, people, AI models, etc.) |
KB facts are rendered on wiki pages via <FBF> and <FBFactValue> components, and computed values via <Calc>. See Data System Authority Rules for which system is authoritative for which entities.
4. YAML Files (data/)
Human-editable YAML files are the source of truth for content metadata:
| Directory | Content |
|---|---|
data/entities/ | Entity definitions (type, description, relations) |
data/facts/ | Legacy facts (deprecated for entities migrated to KB) |
data/resources/ | External resource metadata |
data/graphs/ | Cause-effect graph data |
data/edit-logs/ | Per-page edit history |
data/citation-archive/ | Per-page citation verification YAML |
data/auto-update/ | Auto-update system configuration and state |
YAML files are checked into git and are the canonical source for everything they contain. PostgreSQL mirrors some of this data for API access and full-text search.
5. File-System Caches (.cache/)
Temporary files for local development workflows:
| Path | Purpose |
|---|---|
.cache/sources/ | Fetched source documents (HTML, text, PDF) |
.cache/content-hashes.json | MD5 hashes for change detection during scans |
These are gitignored and can be deleted without data loss.
6. Build Artifact (database.json)
The build pipeline (apps/web/scripts/build-data.mjs) compiles YAML + MDX frontmatter into apps/web/src/data/database.json. This single JSON file contains all entities, pages, relations, facts, search data, and statistics needed by the Next.js frontend.
pnpm build-data # Full build (~2 min)
pnpm build-data:content # Content-only rebuild (~15s)
The JSON is loaded at server startup with lazy-built indexes (see Architecture).
Data Flow
Diagram (loading…)
flowchart LR
subgraph Edit["Authoring"]
AUTHOR["Human or AI
edits YAML/MDX"]
end
subgraph Pipeline["Build Pipeline"]
BUILD["build-data.mjs"]
end
subgraph Serve["Runtime"]
NEXT["Next.js
reads database.json"]
API["Wiki-server API
reads PostgreSQL"]
end
AUTHOR -->|"git push"| BUILD
BUILD -->|"database.json"| NEXT
AUTHOR -->|"crux citations verify"| API
API -->|"search, facts, claims"| NEXTSource Fetching Flow
When verifying citations or fetching content for page improvement:
Diagram (loading…)
sequenceDiagram
participant CLI as Crux CLI
participant Cache as LRU Cache
participant PG as PostgreSQL
participant Net as Network
CLI->>Cache: getCachedContent(url)
alt Cache hit
Cache-->>CLI: cached content
else Cache miss
CLI->>PG: query citation_content
alt DB hit
PG-->>CLI: stored content
CLI->>Cache: setCachedContent(url)
else DB miss
CLI->>Net: fetch via Firecrawl / fallback
Net-->>CLI: raw content
CLI->>PG: saveFetchResultToPostgres(url)
CLI->>Cache: setCachedContent(url)
end
endCLI Commands
| Command | Purpose |
|---|---|
pnpm crux citations verify <page-id> | Verify all citations on a page |
pnpm crux citations audit | Run citation audits across pages |
pnpm crux scan-content | Scan MDX files for content analysis |
pnpm crux query search "topic" | Full-text search via wiki-server |
pnpm crux query entity <id> | Look up entity data |
pnpm crux query related <id> | Find related pages |
pnpm crux context for-page <id> | Full research context for a page |
pnpm build-data | Rebuild database.json from YAML + MDX |
pnpm build-data:content | Content-only rebuild (≈15s) |
Limitations
- No offline PostgreSQL access: CLI commands that query the wiki-server require network connectivity
- LRU cache is session-scoped: Restarting a process loses cached content (by design — PostgreSQL is the durable tier)
- database.json must be rebuilt: Changes to YAML or MDX frontmatter are not visible to the frontend until
build-dataruns - Citation content is append-mostly: Old fetched content is not automatically refreshed
Related
- Architecture — System overview and design patterns
- Automation Tools — Full CLI reference
- Data System Authority Rules — Which data system is authoritative for each entity