A Comprehensive Survey of DPO
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
A comprehensive 2024 survey useful for researchers seeking an organized overview of DPO methods and variants as an alternative to RLHF for LLM alignment; more technical reference than introductory material.
Paper Details
Metadata
Abstract
With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.
Summary
This survey provides a systematic review of Direct Preference Optimization (DPO), an RL-free alternative to RLHF for aligning LLMs with human preferences. It categorizes recent research across theoretical analyses, algorithm variants, preference datasets, and applications, while identifying open challenges and proposing future research directions.
Key Points
- •DPO eliminates the need for a separate reward model and RL training loop, simplifying the RLHF pipeline while achieving competitive alignment performance.
- •The survey categorizes DPO research by key questions: theoretical foundations, algorithmic variants, dataset construction, and downstream applications.
- •Covers extensions of DPO to multimodal settings including Large Vision Language Models (LVLMs), broadening applicability beyond text-only alignment.
- •Identifies persistent challenges in DPO including distribution shift, data quality sensitivity, and theoretical gaps compared to full RLHF.
- •Maintains an actively updated GitHub repository of relevant papers, serving as a living reference for the alignment research community.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| RLHF | Research Area | 63.0 |
Cached Content Preview
HTML conversions [sometimes display errors](https://info.dev.arxiv.org/about/accessibility_html_error_messages.html) due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
- failed: fontawesome5
- failed: forest
Authors: achieve the best HTML results from your LaTeX submissions by following these [best practices](https://info.arxiv.org/help/submit_latex_best_practices.html).
[License: arXiv.org perpetual non-exclusive license](https://info.arxiv.org/help/license/index.html#licenses-available)
arXiv:2410.15595v3 \[cs.AI\] 14 Jul 2025
# A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications
Report issue for preceding element
Wenyi Xiao1111Equal contribution. , Zechuan Wang1111Equal contribution. , Leilei Gan1111Equal contribution.222Corresponding author. , Shuai Zhao2, Zongyue Li1, Ruirui Lei1,
Wanggui He3, Luu Anh Tuan2, Long Chen3, Hao Jiang3, Zhou Zhao1, Fei Wu1
1Zhejiang University, China;
2Nanyang Technological University, Singapore;
3Alibaba Group, China;
Report issue for preceding element
###### Abstract
Report issue for preceding element
With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO’s various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO’s current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on [https://github.com/Mr-Loevan/DPO-Survey](https://github.com/Mr-Loevan/DPO-Survey "").
Report issue for preceding element
## 1 Introduction
Report issue for preceding element
Through pre-training on extensive, high-quality corpora using the next token prediction objective with a huge amount of computational costs, Large Language Models (LLMs) OpenAI ( [2022](https://arxiv.org/html/2410.15595v3#bib.bib1 "")); Touvron et al. ( [2023a](https://arxiv.org/html/2410.15595v3#bib.bib2 "")); OpenAI ( [2024a](https://arxiv.org/html/2410.15595v3#bib.bib3 "")); Jiang et al. ( [2023](https://arxiv.org/html/2410.15595v3#bib.bib4 "")) assimilate comprehensive world knowledge into their internal parameters, demonstrating impre
... (truncated, 98 KB total)573756885a2318ef | Stable ID: ZjI4YTY3Mm