Specific versus General Principles for Constitutional AI

paper

2023·arXiv·arxiv.org/html/2310.13798

Authors

Sandipan Kundu·Yuntao Bai·Saurav Kadavath·Amanda Askell·Andrew Callahan·Anna Chen·Anna Goldie·Avital Balwit·Azalia Mirhoseini·Brayden McLean·Catherine Olsson·Cassie Evraets·Eli Tran-Johnson·Esin Durmus·Ethan Perez·Jackson Kernion·Jamie Kerr·Kamal Ndousse·Karina Nguyen·Nelson Elhage·Newton Cheng·Nicholas Schiefer·Nova DasSarma·Oliver Rausch·Robin Larson·Shannon Yang·Shauna Kravec·Timothy Telleen-Lawton·Thomas I. Liao·Tom Henighan·Tristan Hume·Zac Hatfield-Dodds·Sören Mindermann·Nicholas Joseph·Sam McCandlish·Jared Kaplan

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper is directly relevant to Anthropic's Constitutional AI framework and offers empirical guidance on how to craft principles for AI oversight, useful for researchers designing RLHF or CAI pipelines.

Paper Details

Citations

1 influential

Year

2023

arXiv:2310.13798 DOI:10.48550/arXiv.2310.13798 Semantic Scholar

Metadata

Importance: 62/100arxiv preprintprimary source

Abstract

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.

Summary

This paper investigates whether Constitutional AI works better with specific, detailed principles or more general, abstract ones when training AI systems to be helpful, harmless, and honest. It compares the effectiveness of different principle formulations in guiding AI behavior and evaluates their impact on model alignment and safety properties.

Key Points

•Examines trade-offs between specific vs. general constitutional principles in guiding AI behavior during RLHF/CAI training
•Tests whether detailed, concrete rules or broader abstract principles produce better-aligned and safer AI outputs
•Provides empirical analysis of how principle specificity affects model responses to harmful and edge-case prompts
•Contributes to understanding of how to design constitutions for Constitutional AI systems more effectively
•Relevant to scaling alignment techniques that rely on principle-based feedback rather than human annotation

Cited by 1 page

Page	Type	Quality
Epistemic Sycophancy	Risk	60.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

Specific versus General Principles for Constitutional AI 
 
 
 
 
 

 
 

 
 
 
 
 Specific versus General Principles for Constitutional AI

 
 
 Sandipan Kundu, Yuntao Bai, Saurav Kadavath & Amanda Askell,
Andrew Callahan,
Anna Chen,
Anna Goldie,
Avital Balwit,
Azalia Mirhoseini,
 
 Correspondence to: {sandipan,jared}@anthropic.com 
 Author contributions are detailed in 7 . 
    
 Brayden McLean,
Catherine Olsson,
Cassie Evraets,
Eli Tran-Johnson,
Esin Durmus,
Ethan Perez,
 
 
    
 Jackson Kernion,
Jamie Kerr,
Kamal Ndousse,
Karina Nguyen,
Nelson Elhage,
Newton Cheng,
 
 
    
 Nicholas Schiefer,
Nova DasSarma,
Oliver Rausch,
Robin Larson,
Shannon Yang,
Shauna Kravec,
 
 
    
 Timothy Telleen-Lawton,
Thomas I. Liao,
Tom Henighan,
Tristan Hume,
Zac Hatfield-Dodds, 
 
    
 Sören Mindermann,
Nicholas Joseph,
Sam McCandlish,
Jared Kaplan 1 1 footnotemark: 1 
 \AND 
 Anthropic 
 Department of Computer Science, University of Oxford 
 
 
 Abstract

 Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as “do what’s best for humanity”. We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely. 

 
 

 
 
 1 Introduction

 
 The method of Constitutional AI (CAI) [ 1 ] makes it possible to train a harmless AI assistant via self-improvement, without requiring any human supervision to identify harmful outputs. It generalizes Reinforcement Learning from Human Feedback (RLHF) [ 2 ] for large language models [ 3 ] by essentially replacing human feedback with feedback from an AI system prompted with a short list of principles, the “constitution”. This allows for more precise control of AI behavior with only minimal human input, but raises a variety of questions:

 
 
 • 
 
 CAI uses a list of explicit principles to mold AI behavior via self-supervision. But this begs the immediate question of how AI behavior depends on the specific principles we use, and whether a single principle like “do what’s best for humanity” can produce 1 1 1 We thank Ilya Sutskever for emphasizing

... (truncated, 98 KB total)

Resource ID: 0f04a85e10fdac20 | Stable ID: sid_LdnnAbNHyY