Constitutional AI: Harmlessness from AI Feedback (Anthropic)
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
Foundational Anthropic paper introducing Constitutional AI, a widely-cited alignment technique used in Claude's training; highly relevant to scalable oversight, RLHF alternatives, and making AI values explicit.
Metadata
Summary
Constitutional AI (CAI) is Anthropic's method for training AI systems to be helpful and harmless using a set of principles ('constitution') rather than relying solely on human feedback for every judgment. The approach uses AI-generated critiques and revisions to reduce harmful outputs, combined with reinforcement learning from AI feedback (RLAIF). It demonstrates that safety and helpfulness can be improved simultaneously with reduced human labeling burden.
Key Points
- •Introduces a 'constitution'—a set of written principles—that guides AI self-critique and revision of its own outputs during training.
- •Uses RLAIF (Reinforcement Learning from AI Feedback) as a scalable alternative to purely human feedback for harmlessness training.
- •Shows that CAI-trained models can be both less harmful and less evasive than standard RLHF models, improving the helpfulness-harmlessness tradeoff.
- •The method increases transparency by making the normative values guiding AI behavior explicit and auditable in the constitution.
- •Relevant to scalable oversight and corrigibility: demonstrates AI systems can be trained to self-correct toward human-specified norms.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| AI Disinformation | Risk | 54.0 |
| Instrumental Convergence | Risk | 64.0 |
Cached Content Preview
A 404 poem by Claude Haiku 4.5Claude Sonnet 4.5Claude Opus 4.5 Hyperlin
ca0da848a3ad4301 | Stable ID: NmIwNTJlZD