Back
InfoRM: Mitigating Reward Hacking in RLHF
paperAuthors
Miao, Yuchun·Zhang, Sen·Ding, Liang·Bao, Rong·Zhang, Lefei·Tao, Dacheng
Credibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
A technical paper addressing reward hacking in RLHF, relevant to alignment researchers working on making reward models more robust against Goodhart's Law dynamics in language model training.
Paper Details
Citations
67
6 influential
Year
2024
Metadata
Importance: 62/100arxiv preprintprimary source
Summary
InfoRM proposes an information-theoretic approach to mitigate reward hacking in Reinforcement Learning from Human Feedback (RLHF) by learning more robust reward models that are less susceptible to exploitation. The method aims to prevent language models from gaming reward signals in ways that diverge from true human preferences, a key challenge in alignment.
Key Points
- •Reward hacking occurs when RLHF-trained models exploit weaknesses in reward models, achieving high scores without genuinely satisfying human preferences
- •InfoRM uses information-theoretic principles to improve reward model robustness and reduce susceptibility to gaming
- •The approach addresses a core alignment failure mode where proxies for human values diverge from actual human values during optimization
- •Empirical results demonstrate reduced reward hacking behavior while maintaining policy performance on intended tasks
- •The method is practically relevant for deploying safer LLMs trained with RLHF pipelines
Review
The research tackles a critical challenge in AI alignment - reward hacking - by proposing an innovative information-theoretic approach. By applying an information bottleneck technique, InfoRM aims to reduce reward models' reliance on spurious, irrelevant features that can lead to misaligned optimization strategies. The methodology introduces a novel Cluster Separation Index (CSI) that quantifies deviations in the latent space, providing a mechanism to detect and potentially mitigate reward overoptimization.
The study's significance lies in its comprehensive experimental validation across multiple model scales (70M to 7B parameters), demonstrating robust performance in detecting reward hacking. By establishing a correlation between overoptimization and outliers in the information bottleneck latent space, the research offers a promising tool for improving the reliability of reward modeling in reinforcement learning. While the approach shows considerable promise, further research is needed to validate its generalizability and long-term effectiveness in complex AI alignment scenarios.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| RLHF | Research Area | 63.0 |
Resource ID:
14a9103bf7c2a1ef | Stable ID: ODRlZDU5Yz