Back
Paul Christiano - TIME100 AI
webCredibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: TIME
A mainstream media profile of a leading alignment researcher; useful for understanding Christiano's contributions to RLHF and ARC's role as an independent evaluator of frontier AI systems, though not a primary technical source.
Metadata
Importance: 62/100news articlecommentary
Summary
A TIME100 AI profile of Paul Christiano, co-inventor of Reinforcement Learning from Human Feedback (RLHF) and founder of the Alignment Research Center (ARC). The piece covers his development of RLHF at OpenAI, his transition to theoretical alignment research, and ARC's role in evaluating dangerous capabilities in frontier AI models.
Key Points
- •Paul Christiano is a principal architect of RLHF, the technique that trains AI systems to learn human preferences by ranking outputs rather than specifying goals explicitly.
- •He founded the Alignment Research Center (ARC), a nonprofit focused on theoretical alignment research and evaluating dangerous capabilities in frontier models.
- •ARC serves as a trusted third-party evaluator for OpenAI and Anthropic when deciding whether to release new models.
- •Christiano left OpenAI in 2021 to return to theoretical research, which better suited his background in learning theory.
- •Early RLHF work spanned both simulated robotics and games before being adapted to language models, roughly concurrent with GPT-1 development.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Model Organisms of Misalignment | Analysis | 65.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 20269 KB
Sep 7, 2023 7:00 AM ET
# Paul Christiano
Founder, Alignment Research Center

by
[Will Henshall](https://time.com/author/will-henshall/)

Illustration by TIME; reference image courtesy of Paul Christiano
King Midas, a figure from Greek mythology, wished that everything he touched would turn to gold. His wish was granted, but his gift quickly became a curse as even food and his daughter were transformed.
A decade ago, many AI doomsday thought experiments involved King Midas scenarios, in which humans told an AI system what to do, and the AI, in doing so maximally and literally, caused a catastrophe. An AI system told to maximize the output of a paper-clip factory would turn all of the atoms on earth, including those that make up human bodies, into paper clips, for example.
Those who work on alignment—the problem of ensuring AI systems behave as their creators intend—no longer worry about this. Researchers can now train AI systems to iteratively learn difficult-to-articulate goals by having humans rank the responses an AI gives by how helpful they are, and having the AI system learn to produce results that it predicts will be rated as helpful as possible. With this method, humans don’t have to say what they want the AI to do, they can simply tell the AI if it has done what they wanted.
Advertisement
This technique is known as reinforcement learning from human feedback (RLHF). Paul Christiano is one of its principal architects. Among the most respected researchers in the field of alignment, Christiano joined OpenAI in 2017. Four years later, he left to set up the Alignment Research Center (ARC), a Berkeley, Calif.–based nonprofit research organization that carries out theoretical alignment research and develops techniques to test whether an AI model has dangerous capabilities. When OpenAI and Anthropic want to know whether they should release a model, they ask ARC.
TIME spoke with Christiano about the invention of RLHF, leaving OpenAI, his work at ARC, and the idiosyncrasies of the AI alignment community. _(This interview has been condensed and edited for clarity.)_
**TIME: Could you describe the development of RLHF as a technology?**
**Paul Christiano:** Starting with backstory, before I was at OpenAI, there are two relevant threads to be aware of. One is that I’ve been thinking about alignment for a pretty long time and trying to understand what a plausible alignment solution looks like. RLHF stands out as a very early and natural step.
I think a second thread to be aware of is that a bunch of people have worked, normally in much simpler settings, on learning values from humans. There
... (truncated, 9 KB total)Resource ID:
78d569493b2b3825 | Stable ID: OGY4M2YxOT