RLHF
Alignment TrainingmatureReinforcement Learning from Human Feedback — training technique that fine-tunes AI models using human preference ratings to align outputs with human values.
Organizations
4
Key Papers
3
Grants
4
Total Funding
$637K
Risks Addressed
3
First Proposed: 2017 (Christiano et al.)
Cluster: Alignment Training
Tags
traininghuman-feedbackalignment
Organizations4
| Organization | Role |
|---|---|
| Anthropic | pioneer |
| Google DeepMind | active |
| Meta AI (FAIR) | active |
| OpenAI | pioneer |
Grants4
| Name | Recipient | Amount | Funder | Date |
|---|---|---|---|---|
| Compute and other expenses for LLM alignment research | Ethan Josean Perez | $400K | Manifund | 2023-08-19 |
| Grant to "support a NeurIPS competition applying human feedback in a non-language-model setting, specifically pretrained models in Minecraft." | Berkeley Existential Risk Initiative | $155K | FTX Future Fund | 2022-05 |
| Berkeley Existential Risk Initiative — MineRL BASALT Competition | Berkeley Existential Risk Initiative | $70K | Coefficient Giving | 2021-07 |
| 4-month salary for a research visit with David Krueger on evaluating non-myopia in language models and RLHF systems | Alan Chan | $12K | Long-Term Future Fund (LTFF) | 2022 |
Funding by Funder
| Funder | Grants | Total Amount |
|---|---|---|
| Manifund | 1 | $400K |
| FTX Future Fund | 1 | $155K |
| Coefficient Giving | 1 | $70K |
| Long-Term Future Fund (LTFF) | 1 | $12K |
Key Papers & Resources3
SEMINAL
Deep Reinforcement Learning from Human Preferences
Christiano et al.2017
SEMINAL
Training language models to follow instructions with human feedback
Ouyang et al. (OpenAI)2022
SEMINAL
Training a Helpful and Harmless Assistant with RLHF
Bai et al. (Anthropic)2022