Specific versus General Principles for Constitutional AI
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
This paper is directly relevant to Anthropic's Constitutional AI framework and offers empirical guidance on how to craft principles for AI oversight, useful for researchers designing RLHF or CAI pipelines.
Paper Details
Metadata
Abstract
Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.
Summary
This paper investigates whether Constitutional AI works better with specific, detailed principles or more general, abstract ones when training AI systems to be helpful, harmless, and honest. It compares the effectiveness of different principle formulations in guiding AI behavior and evaluates their impact on model alignment and safety properties.
Key Points
- •Examines trade-offs between specific vs. general constitutional principles in guiding AI behavior during RLHF/CAI training
- •Tests whether detailed, concrete rules or broader abstract principles produce better-aligned and safer AI outputs
- •Provides empirical analysis of how principle specificity affects model responses to harmful and edge-case prompts
- •Contributes to understanding of how to design constitutions for Constitutional AI systems more effectively
- •Relevant to scaling alignment techniques that rely on principle-based feedback rather than human annotation
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Epistemic Sycophancy | Risk | 60.0 |
Cached Content Preview
# Specific versus General Principles for Constitutional AI
Sandipan Kundu, Yuntao Bai, Saurav Kadavath &Amanda Askell,
Andrew Callahan,
Anna Chen,
Anna Goldie,
Avital Balwit,
Azalia Mirhoseini,
Correspondence to: {sandipan,jared}@anthropic.com
Author contributions are detailed in [7](https://arxiv.org/html/2310.13798#S7 "7 Contribution Statement ‣ Specific versus General Principles for Constitutional AI"). Brayden McLean,
Catherine Olsson,
Cassie Evraets,
Eli Tran-Johnson,
Esin Durmus,
Ethan Perez,
Jackson Kernion,
Jamie Kerr,
Kamal Ndousse,
Karina Nguyen,
Nelson Elhage,
Newton Cheng,
Nicholas Schiefer,
Nova DasSarma,
Oliver Rausch,
Robin Larson,
Shannon Yang,
Shauna Kravec,
Timothy Telleen-Lawton,
Thomas I. Liao,
Tom Henighan,
Tristan Hume,
Zac Hatfield-Dodds, Sören Mindermann,
Nicholas Joseph,
Sam McCandlish,
Jared Kaplan11footnotemark: 1\\AND
AnthropicDepartment of Computer Science, University of Oxford
###### Abstract
Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as “do what’s best for humanity”. We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.
## 1 Introduction
The method of Constitutional AI (CAI) \[ [1](https://arxiv.org/html/2310.13798#bib.bib1 "")\] makes it possible to train a harmless AI assistant via self-improvement, without requiring any human supervision to identify harmful outputs. It generalizes Reinforcement Learning from Human Feedback (RLHF) \[ [2](https://arxiv.org/html/2310.13798#bib.bib2 "")\] for large language models \[ [3](https://arxiv.org/html/2310.13798#bib.bib3 "")\] by essentially replacing human feedback with feedback from an AI system prompted with a short list of principles, the “constitution”. This allows for more precise control of AI behavior with only minimal human input, but raises a variety of questions:
- •
CAI uses a list of explicit principles to mold AI behavior via self-supervision. But this begs the immediate question of how AI behavior depends on the specific principles we use, and wheth
... (truncated, 98 KB total)0f04a85e10fdac20 | Stable ID: MGZlNjgzMm