RLHF-V: Towards Trustworthy Multimodal LLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

web

openaccess.thecvf.com·openaccess.thecvf.com/content/CVPR2024/papers/Yu_RLHF-V_T...

A CVPR 2024 paper directly relevant to AI safety alignment research, focusing on reducing hallucinations in multimodal LLMs through fine-grained RLHF—applicable to practitioners working on trustworthy and reliable AI systems.

Metadata

Importance: 62/100conference paperprimary source

Summary

RLHF-V introduces a method for improving the trustworthiness of Multimodal Large Language Models (MLLMs) by aligning model behavior using fine-grained correctional human feedback. The approach collects segment-level human corrections on model hallucinations and uses them to train models via dense direct preference optimization, significantly reducing hallucinations. The method demonstrates strong performance improvements on benchmarks measuring MLLM trustworthiness.

Key Points

•Proposes RLHF-V, a framework that uses fine-grained human correctional feedback at the segment level to reduce hallucinations in multimodal LLMs.
•Introduces dense Direct Preference Optimization (DDPO) to leverage detailed human corrections more effectively than standard preference learning approaches.
•Collects a dataset of human-annotated correctional feedback on MLLM outputs, enabling more precise behavior alignment.
•Achieves significant reductions in hallucination rates on benchmarks like POPE and MMHal-Bench compared to baselines including LLaVA and InstructBLIP.
•Addresses a key AI safety concern: factual reliability and trustworthiness of multimodal AI systems deployed in real-world settings.

Cited by 2 pages

Page	Type	Quality
Why Alignment Might Be Easy	Argument	53.0
AI Alignment	Approach	91.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202651 KB

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from
Fine-grained Correctional Human Feedback
Tianyu Yu1 Yuan Yao2* Haoye Zhang1 Taiwen He1 Yifeng Han1
Ganqu Cui1 Jinyi Hu1 Zhiyuan Liu1∗ Hai-Tao Zheng134∗ Maosong Sun1
1Tsinghua University 2National University of Singapore
3Shenzhen International Graduate School, Tsinghua University
4Pengcheng Laboratory, Shenzhen, China
yiranytianyu@gmail.com yaoyuanthu@gmail.com
https://rlhf-v.github.io
Abstract
Multimodal Large Language Models (MLLMs) have re-
cently demonstrated impressive capabilities in multimodal
understanding, reasoning, and interaction. However, ex-
isting MLLMs prevalently suffer from serious hallucination
problems, generating text that is not factually grounded in
associated images. The problem makes existing MLLMs un-
trustworthy and thus impractical in real-world (especially
high-stakes) applications. To address the challenge, we
present RLHF-V, which enhances MLLM trustworthiness
via behavior alignment from fine-grained correctional hu-
man feedback. Specifically, RLHF-V collects human pref-
erence in the form of segment-level corrections on halluci-
nations, and performs dense direct preference optimization
over the human feedback. Comprehensive experiments on
five benchmarks in both automatic and human evaluation
show that, RLHF-V can enable substantially more trustwor-
thy MLLM behaviors with promising data and computation
efficiency. Remarkably, using 1.4k annotated data sam-
ples, RLHF-V significantly reduces the hallucination rate
of the base MLLM by 34.8%, outperforming the concur-
rent LLaVA-RLHF trained on 10k annotated data. The final
model achieves state-of-the-art performance in trustwor-
thiness among open-source MLLMs, and shows better ro-
bustness than GPT-4V in preventing hallucinations aroused
from over-generalization.
1. Introduction
The recent success of Multimodal Large Language Models
(MLLMs) marks a significant milestone in AI research [2, 4,
*Corresponding authors
11, 12, 19, 27, 29, 42, 51]. By connecting visual signals and
Large Language Models (LLMs), MLLMs show unprece-
dented capabilities in multimodal understanding, reasoning,
and interaction [28, 29, 44]. The models are typically pre-
trained on large-scale image-text data to learn the founda-
tional multimodal knowledge and capabilities [2, 4, 12, 19].
To steer the model behavior, most MLLMs are further fine-
tuned with instruction tuning (also known as supervised
fine-tuning), which supervises models to clone the behav-
ior from demonstration data, enabling MLLMs to engage
users in various open-world tasks [4, 11, 25, 27, 47].
However, current MLLM behaviors are not well aligned
with human preferences. A glaring issue is their tendency
to produce hallucinations — responses that are not factu-
ally grounded in the associated images [21, 25, 29, 37].
This typically includes descriptions of non-existing visual
contents and errors in descriptions. As shown in Figure 1,
current MLLMs can hallucinat

... (truncated, 51 KB total)

Resource ID: 108f52553230c4d5 | Stable ID: sid_KXPhtblXrB