[PDF] A Comprehensive Survey - AI Alignment

web
alignmentsurvey.com·alignmentsurvey.com/uploads/AI-Alignment-A-Comprehensive-...
Metadata

Cited by 1 page

Page	Type	Quality
Goal Misgeneralization	Risk	63.0
Cached Content Preview

HTTP 200Fetched May 1, 202698 KB
AI Alignment: A Comprehensive Survey

\*,1 \*,1 \*,1 \*,1 1 1
Jiaming Ji Tianyi Qiu Boyuan Chen Borong Zhang Hantao Lou Kaile Wang
2 2 1 1 1 1
Yawen Duan Zhonghao He Jiayi Zhou Zhaowei Zhang Fanzhi Zeng Juntao Dai
1 5 1 4 3
Xuehai Pan Kwan Yee Ng Aidan O’Gara Hua Xu Brian Tse Jie Fu Stephen McAleer
1, 1 1 4 1
Yaodong Yang Yizhou Wang Song-Chun Zhu Yike Guo Wen Gao

1 2 3
Peking University University of Cambridge Carnegie Mellon University
4 5
Hong Kong University of Science and Technology University of Southern California

Abstract

AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems
grow more capable, so do risks from misalignment. To provide a comprehensive and up-to-date overview
of the alignment field, in this survey, we delve into the core concepts, methodology, and practice of alignment. First, we identify four principles as the key objectives of AI alignment: Robustness, Interpretability,
Controllability, and Ethicality (RICE). Guided by these four principles, we outline the landscape of current alignment research and decompose them into two key components: forward alignment and backward alignment. The former aims to make AI systems aligned via alignment training, while the latter
aims to gain evidence about the systems’ alignment and govern them appropriately to avoid exacerbating
misalignment risks. On forward alignment, we discuss techniques for learning from feedback and learning
under distribution shift. Specifically, we survey traditional preference modeling methods and reinforcement learning from human feedback, and further discuss potential frameworks to reach scalable oversight
for tasks where effective human oversight is hard to obtain. Within learning under distribution shift, we
also cover data distribution interventions such as adversarial training that help expand the distribution of
training data, and algorithmic interventions to combat goal misgeneralization. On backward alignment,
we discuss assurance techniques and governance practices. Specifically, we survey assurance methods
of AI systems throughout their lifecycle, covering safety evaluation, interpretability, and human value
compliance. We discuss current and prospective governance practices adopted by governments, industry
actors, and other third parties, aimed at managing existing and future AI risks.
This survey aims to provide a comprehensive yet beginner-friendly review of alignment research topics.

This survey aims to provide a comprehensive yet beginner-friendly review of alignment research topics.
Based on this, we also release and continually update the website [www.alignmentsurvey.com](http://www.alignmentsurvey.com/)
which features tutorials, collections of papers, blog posts, and other resources.

* * *

Contents

# 1 Introduction 3

1.1 The Motivation for Alignment 3
1.1.1 Risks of Misalignment 3
1.1.2 Causes of Misalignment 4
1.2 The Scope of Alignment 9
1.2.1 The Alignment Cycle: A Framework of Alignment

... (truncated, 98 KB total)
Resource ID: c732e89c422e34d7 | Stable ID: sid_Au7eL3yC5Q