Skip to content
Longterm Wiki
Back

Constitutional AI: Harmlessness from AI Feedback (Anthropic)

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: Anthropic

Foundational Anthropic paper introducing Constitutional AI, a widely-cited alignment technique used in Claude's training; highly relevant to scalable oversight, RLHF alternatives, and making AI values explicit.

Metadata

Importance: 85/100blog postprimary source

Summary

Constitutional AI (CAI) is Anthropic's method for training AI systems to be helpful and harmless using a set of principles ('constitution') rather than relying solely on human feedback for every judgment. The approach uses AI-generated critiques and revisions to reduce harmful outputs, combined with reinforcement learning from AI feedback (RLAIF). It demonstrates that safety and helpfulness can be improved simultaneously with reduced human labeling burden.

Key Points

  • Introduces a 'constitution'—a set of written principles—that guides AI self-critique and revision of its own outputs during training.
  • Uses RLAIF (Reinforcement Learning from AI Feedback) as a scalable alternative to purely human feedback for harmlessness training.
  • Shows that CAI-trained models can be both less harmful and less evasive than standard RLHF models, improving the helpfulness-harmlessness tradeoff.
  • The method increases transparency by making the normative values guiding AI behavior explicit and auditable in the constitution.
  • Relevant to scalable oversight and corrigibility: demonstrates AI systems can be trained to self-correct toward human-specified norms.

Cited by 2 pages

PageTypeQuality
AI DisinformationRisk54.0
Instrumental ConvergenceRisk64.0

Cached Content Preview

HTTP 200Fetched Mar 20, 20260 KB
A 404 poem by Claude Haiku 4.5Claude Sonnet 4.5Claude Opus 4.5

Hyperlin
Resource ID: ca0da848a3ad4301 | Stable ID: NmIwNTJlZD