Jan Leike Argues for Alignment Optimism

web

Substack·aligned.substack.com/p/alignment-optimism

Credibility Rating

2/5

Mixed(2)

Mixed quality. Some useful content but inconsistent editorial standards. Claims should be verified.

Rating inherited from publication venue: Substack

Jan Leike is a prominent alignment researcher who co-led OpenAI's alignment team before departing in 2024; his perspective on alignment tractability carries significant weight in the field.

Metadata

Importance: 55/100blog postcommentary

Summary

Jan Leike, former OpenAI alignment researcher, presents arguments for cautious optimism about solving the AI alignment problem, outlining reasons to believe technical alignment research can succeed. The piece discusses progress in the field and conditions under which alignment challenges may be tractable.

Key Points

•Jan Leike outlines reasons for optimism that AI alignment is solvable despite significant technical challenges
•Discusses the state of alignment research and tractability of core alignment problems
•Argues that current research directions show meaningful progress toward aligned AI systems
•Considers conditions and assumptions underlying an optimistic alignment outlook
•Balances acknowledgment of risks with belief that concerted effort can produce safe AI

Cited by 1 page

Page	Type	Quality
Why Alignment Might Be Easy	Argument	53.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202642 KB

Why I’m optimistic about our alignment approach 
 
 
 
 
 

 

 

 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 

 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 
 
 
 
 

 

 
 
 

 

 

 

 

 
 

 
 

 

 

 

 
 Musings on the Alignment Problem 

 Subscribe Sign in Why I’m optimistic about our alignment approach

 Some arguments in favor and responses to common objections

 Jan Leike Dec 05, 2022 60 16 2 Share OpenAI’s approach to alignment research involves perfecting RLHF , AI-assisted human evaluation , and automated alignment research . Why is this a good strategy? What are the reasons to be optimistic about it? My optimism stems from five sources: 

 Positive updates about AI. A lot of developments over the last few years have made AI systems more favorable to alignment than they looked initially, both in terms of how the AI tech tree is shaking out and the empirical evidence on alignment we’ve gathered so far. 

 A more modest goal. We’re not trying to solve all alignment problems. We’re just trying to align a system that’s capable enough to make more alignment progress than we can. 

 Evaluation is easier than generation. This is a very general principle that holds across many domains. It’s true for alignment research as well. 

 We’re setting ourselves up for iteration. We can set ourselves up for iterative, measurable improvements on our alignment path. 

 Conviction in language models. Language models will be smart enough to get useful alignment research work out of them. 

 Nevertheless, there is still so much to do and it’s critical to remember that aligning systems that are smarter than us will look very different than aligning today’s models. It’s also important to distinguish between optimism and caution: the burden of proof is always on showing that a new system is sufficiently aligned and we cannot shift the burden of proof to showing that the situation has changed compared to earlier systems.

 The last section responds to some common objections to our approach . 

 My reasons for optimism 

 1. Positive updates about AI 

 1.1 The AI tech tree is looking favorably 

 A few years ago it looked like the path to AGI was by training deep RL agents from scratch in a wide range of games and multi-agent environments. These agents would be aligned to maximizing simple score functions such as survival and winning games and wouldn’t know much about human values. Aligning the resulting agents would be a lot of effort: not only do we have to create a human-aligned objective function from scratch, we’d likely also need to instill actually new capabilities into the agents like understanding human society, what humans care about, and how humans think.

 Large language models (LLMs) make this a lot easier: they come preloaded with a lot of humanity’s knowledge, including detailed knowledge about human preferences and values. Out of the box they aren’t a

... (truncated, 42 KB total)

Resource ID: d5d3cc4553499475 | Stable ID: sid_JOFdHN7QQU