Constitutional AI: Harmlessness from AI Feedback
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
Foundational Anthropic paper introducing Constitutional AI and RLAIF, directly influential on Claude's training methodology and a major contribution to scalable alignment research.
Metadata
Summary
Anthropic introduces Constitutional AI (CAI), a method for training AI systems to be harmless using a set of principles (a 'constitution') and AI-generated feedback rather than relying solely on human labelers. The approach uses a two-stage process: supervised learning from AI-critiqued revisions, followed by reinforcement learning from AI feedback (RLAIF). This reduces dependence on human feedback for identifying harmful outputs while maintaining helpfulness.
Key Points
- •Introduces Constitutional AI: a self-improvement loop where the AI critiques and revises its own outputs guided by a written set of principles.
- •Uses RLAIF (Reinforcement Learning from AI Feedback) as a scalable alternative to RLHF for harmlessness training.
- •Reduces need for human-labeled harmful content, making safety training more scalable and transparent.
- •The 'constitution' makes AI values explicit and auditable, improving interpretability of alignment objectives.
- •Trained models show strong harmlessness without significant sacrifice in helpfulness, validated through human evaluations.
Cited by 5 pages
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Concept | 62.0 |
| Racing Dynamics Impact Model | Analysis | 61.0 |
| Paul Christiano | Person | 39.0 |
| AI Value Lock-in | Risk | 64.0 |
| AI Proliferation | Risk | 60.0 |
Cached Content Preview
AlignmentResearch # Constitutional AI: Harmlessness from AI Feedback Dec 15, 2022 [Read Paper](https://arxiv.org/abs/2212.08073) ## Abstract As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels. ## Policy Memo [Constitutional AI Policy Memo](https://www-cdn.anthropic.com/7512771452629584566b6303311496c262da1006/Anthropic_ConstitutionalAI_v2.pdf) [Share on Twitter](https://twitter.com/intent/tweet?text=https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)[Share on LinkedIn](https://www.linkedin.com/shareArticle?mini=true&url=https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback) ## Related content ### The persona selection model [Read more](https://www.anthropic.com/research/persona-selection-model) ### Anthropic Education Report: The AI Fluency Index We tracked 11 observable behaviors across thousands of Claude.ai conversations to build the AI Fluency Index — a baseline for measuring how people collaborate with AI today. [Read more](https://www.anthropic.com/research/AI-fluency-index) ### Measuring AI agent autonomy in practice [Read more](https://www.anthropic.com/research/measuring-agent-autonomy) Constitutional AI: Harmlessness from AI Feedback \ Anthropic
1000c5dea784ef64 | Stable ID: MzMxYmJhNm