Reward Modeling
Alignment TrainingactiveTraining learned reward functions from human preferences to guide AI optimization, including research on reward hacking and gaming.
Key Papers
1
Grants
2
Total Funding
$15K
First Proposed: 2017 (Christiano et al.)
Cluster: Alignment Training
Tags
trainingrewardalignment
Grants2
| Name | Recipient | Amount | Funder | Date |
|---|---|---|---|---|
| Mitigating Reward Hacking Through RL Training Interventions | Aria Wong | $7.9K | Manifund | 2026-02-18 |
| GoalsRL — Workshop on Goal Specifications for Reinforcement Learning | GoalsRL | $7.5K | Coefficient Giving | 2018-08 |
Funding by Funder
| Funder | Grants | Total Amount |
|---|---|---|
| Manifund | 1 | $7.9K |
| Coefficient Giving | 1 | $7.5K |
Key Papers & Resources1
SEMINAL
Deep Reinforcement Learning from Human Preferences
Christiano et al.2017