Skip to content
Longterm Wiki
Back

Specific versus General Principles for Constitutional AI

paper

Authors

Sandipan Kundu·Yuntao Bai·Saurav Kadavath·Amanda Askell·Andrew Callahan·Anna Chen·Anna Goldie·Avital Balwit·Azalia Mirhoseini·Brayden McLean·Catherine Olsson·Cassie Evraets·Eli Tran-Johnson·Esin Durmus·Ethan Perez·Jackson Kernion·Jamie Kerr·Kamal Ndousse·Karina Nguyen·Nelson Elhage·Newton Cheng·Nicholas Schiefer·Nova DasSarma·Oliver Rausch·Robin Larson·Shannon Yang·Shauna Kravec·Timothy Telleen-Lawton·Thomas I. Liao·Tom Henighan·Tristan Hume·Zac Hatfield-Dodds·Sören Mindermann·Nicholas Joseph·Sam McCandlish·Jared Kaplan

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper is directly relevant to Anthropic's Constitutional AI framework and offers empirical guidance on how to craft principles for AI oversight, useful for researchers designing RLHF or CAI pipelines.

Paper Details

Citations
45
1 influential
Year
2023

Metadata

Importance: 62/100arxiv preprintprimary source

Abstract

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.

Summary

This paper investigates whether Constitutional AI works better with specific, detailed principles or more general, abstract ones when training AI systems to be helpful, harmless, and honest. It compares the effectiveness of different principle formulations in guiding AI behavior and evaluates their impact on model alignment and safety properties.

Key Points

  • Examines trade-offs between specific vs. general constitutional principles in guiding AI behavior during RLHF/CAI training
  • Tests whether detailed, concrete rules or broader abstract principles produce better-aligned and safer AI outputs
  • Provides empirical analysis of how principle specificity affects model responses to harmful and edge-case prompts
  • Contributes to understanding of how to design constitutions for Constitutional AI systems more effectively
  • Relevant to scaling alignment techniques that rely on principle-based feedback rather than human annotation

Cited by 1 page

PageTypeQuality
Epistemic SycophancyRisk60.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
# Specific versus General Principles for Constitutional AI

Sandipan Kundu, Yuntao Bai, Saurav Kadavath &Amanda Askell,
Andrew Callahan,
Anna Chen,
Anna Goldie,
Avital Balwit,
Azalia Mirhoseini,
Correspondence to: {sandipan,jared}@anthropic.com

Author contributions are detailed in [7](https://arxiv.org/html/2310.13798#S7 "7 Contribution Statement ‣ Specific versus General Principles for Constitutional AI"). Brayden McLean,
Catherine Olsson,
Cassie Evraets,
Eli Tran-Johnson,
Esin Durmus,
Ethan Perez,
Jackson Kernion,
Jamie Kerr,
Kamal Ndousse,
Karina Nguyen,
Nelson Elhage,
Newton Cheng,
Nicholas Schiefer,
Nova DasSarma,
Oliver Rausch,
Robin Larson,
Shannon Yang,
Shauna Kravec,
Timothy Telleen-Lawton,
Thomas I. Liao,
Tom Henighan,
Tristan Hume,
Zac Hatfield-Dodds, Sören Mindermann,
Nicholas Joseph,
Sam McCandlish,
Jared Kaplan11footnotemark: 1\\AND

AnthropicDepartment of Computer Science, University of Oxford

###### Abstract

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as “do what’s best for humanity”. We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.

## 1 Introduction

The method of Constitutional AI (CAI) \[ [1](https://arxiv.org/html/2310.13798#bib.bib1 "")\] makes it possible to train a harmless AI assistant via self-improvement, without requiring any human supervision to identify harmful outputs. It generalizes Reinforcement Learning from Human Feedback (RLHF) \[ [2](https://arxiv.org/html/2310.13798#bib.bib2 "")\] for large language models \[ [3](https://arxiv.org/html/2310.13798#bib.bib3 "")\] by essentially replacing human feedback with feedback from an AI system prompted with a short list of principles, the “constitution”. This allows for more precise control of AI behavior with only minimal human input, but raises a variety of questions:

- •


CAI uses a list of explicit principles to mold AI behavior via self-supervision. But this begs the immediate question of how AI behavior depends on the specific principles we use, and wheth

... (truncated, 98 KB total)
Resource ID: 0f04a85e10fdac20 | Stable ID: MGZlNjgzMm