Skip to content
Longterm Wiki

RLHF

Alignment Trainingmature

Reinforcement Learning from Human Feedback — training technique that fine-tunes AI models using human preference ratings to align outputs with human values.

Organizations
4
Key Papers
3
Grants
4
Total Funding
$637K
Risks Addressed
3
First Proposed: 2017 (Christiano et al.)
Cluster: Alignment Training

Tags

traininghuman-feedbackalignment

Organizations4

OrganizationRole
Anthropicpioneer
Google DeepMindactive
Meta AI (FAIR)active
OpenAIpioneer

Grants4

NameRecipientAmountFunderDate
Compute and other expenses for LLM alignment researchEthan Josean Perez$400KManifund2023-08-19
Grant to "support a NeurIPS competition applying human feedback in a non-language-model setting, specifically pretrained models in Minecraft."Berkeley Existential Risk Initiative$155KFTX Future Fund2022-05
Berkeley Existential Risk Initiative — MineRL BASALT CompetitionBerkeley Existential Risk Initiative$70KCoefficient Giving2021-07
4-month salary for a research visit with David Krueger on evaluating non-myopia in language models and RLHF systemsAlan Chan$12KLong-Term Future Fund (LTFF)2022

Funding by Funder

FunderGrantsTotal Amount
Manifund1$400K
FTX Future Fund1$155K
Coefficient Giving1$70K
Long-Term Future Fund (LTFF)1$12K

Key Papers & Resources3