Back
Compute Funding for SERI MATS LLM Alignment Research - Manifund
webThis Manifund project page documents a compute grant supporting SERI MATS-affiliated LLM alignment research; useful as a reference for understanding how alignment research is funded and what topics SERI MATS scholars pursued circa 2023.
Metadata
Importance: 38/100otherhomepage
Summary
A Manifund grant project led by Ethan Perez funding compute for four LLM alignment research projects through SERI MATS and university collaborators. Projects focus on RLHF failure modes (sycophancy, inverse scaling), debate-based alignment, and related technical safety work. The project exceeded its $200K funding goal, raising over $400K.
Key Points
- •Funds compute for 11 researchers across SERI MATS, Oxford, UCL, and NYU working on LLM alignment under Ethan Perez's supervision
- •Project 1 investigates RLHF-driven failures including sycophancy, where LLMs reinforce user biases or misinformation as models scale
- •Project 2 tests debate as a scalable oversight technique, examining whether LLM-generated arguments improve human evaluator accuracy
- •Grant administered by FAR AI; raised $400K against a $200K goal, indicating strong community support
- •Research targets near-term alignment failures with implications for future scaled models and successor methods to RLHF
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Model Organisms of Misalignment | Analysis | 65.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202615 KB
Compute and other expenses for LLM alignment research \| Manifund
6
## Compute and other expenses for LLM alignment research
[Technical AI safety](https://manifund.org/causes/tais)
[Ethan Josean Perez](https://manifund.org/ethanjperez)
ActiveGrant
$400,100raised
$200,000funding goal
Fully funded and not currently accepting donations.
Ethan Perez is currently supervising 4 independent or academia-based research projects on aligning LLMs, which would significantly benefit from additional funding for compute. These projects are led by 11 full-time research collaborators, 3 from universities (Oxford, University of College London, and New York University) and 8 from SERI MATS.
The funding would be administered by FAR; below an outline of the projects.
**1 Finding Failures Driven by Human Feedback**
Reinforcement Learning from Human Feedback (RLHF) has grown increasingly widespread as a technique for aligning LLMs with human preferences. Finding fundamental failures in RLHF is important for under- standing how to address future issues that will come up and still persist, after organizations have improved their RLHF training setups and scaled up models. As a result, we aim to find cases of “inverse scaling” \[1\] driven by RLHF; cases where model behavior gets worse as models grow larger and more effective at op- timizing human preference judgments. By discovering failures driven by RLHF, and showing that human feedback is at fault, we aim to drive the development of successor methods to RLHF, including techniques for using LLMs to aid humans in providing preference judgments \[2\].
Thus far, we have found a number of egregious cases of LLM sycophancy, where LLMs repeat back user views, including cases where the LLM repeats misinformation or blatantly flatters users. We are currently working to generate datasets (e.g., using LLMs) to test these egregious failures more robustly. We are also running experiments to more carefully determine the extent to which these failures are driven by flaws in human feedback vs. remnants of behaviors learned by LLMs during pretraining. With additional funding for compute, we would be able to generate more datasets for testing for RLHF-driven failures in LLMs, as well as test LLMs like GPT4 on more datasets. Moreover, we use LLMs to analyze human feedback datasets themselves (e.g., for understanding what fraction of times the human feedback data incentivizes behaviors like sycophancy).
**2 Debate**
One promising such technique is debate \[3, 4\]: training AI assistants to provide the strongest arguments and counter-arguments for a given answer to a question. Such proposals hypothesize that strong arguments and counter-arguments make it easier for a human to evaluate an answ
... (truncated, 15 KB total)Resource ID:
60649beb13903c7c | Stable ID: N2Q1ZjNjMG