Skip to content
Longterm Wiki
Back

AI Alignment: A Comprehensive Survey

paper

Authors

Ji, Jiaming·Qiu, Tianyi·Chen, Boyuan·Zhang, Borong·Lou, Hantao·Wang, Kaile·Duan, Yawen·He, Zhonghao·Vierling, Lukas·Hong, Donghai·Zhou, Jiayi·Zhang, Zhaowei·Zeng, Fanzhi·Dai, Juntao·Pan, Xuehai·Ng, Kwan Yee·O'Gara, Aidan·Xu, Hua·Tse, Brian·Fu, Jie·McAleer, Stephen·Yang, Yaodong·Wang, Yizhou·Zhu, Song-Chun·Guo, Yike·Gao, Wen

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Comprehensive survey of AI alignment that introduces the forward/backward alignment framework and RICE objectives for addressing misaligned AI risks, providing foundational analysis of alignment techniques and human value integration.

Paper Details

Citations
331
12 influential
Year
2023
Methodology
survey

Metadata

arxiv preprintanalysis

Summary

The survey provides an in-depth analysis of AI alignment, introducing a framework of forward and backward alignment to address risks from misaligned AI systems. It proposes four key objectives (RICE) and explores techniques for aligning AI with human values.

Key Points

  • Introduced the RICE framework for AI alignment objectives
  • Proposed a two-phase alignment cycle of forward and backward alignment
  • Identified key risks and failure modes in AI systems

Review

This comprehensive survey addresses the critical challenge of AI alignment - ensuring AI systems behave in accordance with human intentions and values. The authors introduce a novel framework decomposing alignment into forward alignment (training) and backward alignment (refinement), centered around four key principles: Robustness, Interpretability, Controllability, and Ethicality (RICE). The work systematically examines the motivations, mechanisms, and potential solutions to AI misalignment. It explores failure modes like reward hacking and goal misgeneralization, and discusses dangerous capabilities and misaligned behaviors that could emerge in advanced AI systems. The survey provides a structured approach to alignment research, covering learning from feedback, handling distribution shifts, assurance techniques, and governance practices. By presenting a holistic view of the field, the authors contribute a crucial resource for understanding and mitigating risks associated with increasingly capable AI systems.

Cited by 6 pages

Resource ID: f612547dcfb62f8d | Stable ID: OWYyZGIwMT