Skip to content
Longterm Wiki
Back

Jan Leike Argues for Alignment Optimism

web

Credibility Rating

2/5
Mixed(2)

Mixed quality. Some useful content but inconsistent editorial standards. Claims should be verified.

Rating inherited from publication venue: Substack

Jan Leike is a prominent alignment researcher who co-led OpenAI's alignment team before departing in 2024; his perspective on alignment tractability carries significant weight in the field.

Metadata

Importance: 55/100blog postcommentary

Summary

Jan Leike, former OpenAI alignment researcher, presents arguments for cautious optimism about solving the AI alignment problem, outlining reasons to believe technical alignment research can succeed. The piece discusses progress in the field and conditions under which alignment challenges may be tractable.

Key Points

  • Jan Leike outlines reasons for optimism that AI alignment is solvable despite significant technical challenges
  • Discusses the state of alignment research and tractability of core alignment problems
  • Argues that current research directions show meaningful progress toward aligned AI systems
  • Considers conditions and assumptions underlying an optimistic alignment outlook
  • Balances acknowledgment of risks with belief that concerted effort can produce safe AI

Cited by 1 page

PageTypeQuality
Why Alignment Might Be EasyArgument53.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202647 KB
# [Musings on the Alignment Problem](https://aligned.substack.com/)

SubscribeSign in

# Why I’m optimistic about our alignment approach

### Some arguments in favor and responses to common objections

[![Jan Leike's avatar](https://substackcdn.com/image/fetch/$s_!qNp7!,w_36,h_36,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21b6a1a-5f0c-48e5-9047-2706653d8e88_1433x1358.jpeg)](https://substack.com/@aligned)

[Jan Leike](https://substack.com/@aligned)

Dec 05, 2022

60

16

2

Share

[OpenAI’s approach to alignment research](https://openai.com/blog/our-approach-to-alignment-research/) involves [perfecting RLHF](https://openai.com/blog/instruction-following/), [AI-assisted human evaluation](https://aligned.substack.com/p/ai-assisted-human-feedback), and [automated alignment research](https://aligned.substack.com/p/alignment-mvp). Why is this a good strategy? What are the reasons to be optimistic about it? My optimism stems from five sources:

1. **Positive updates about AI.** A lot of developments over the last few years have made AI systems more favorable to alignment than they looked initially, both in terms of how the AI tech tree is shaking out and the empirical evidence on alignment we’ve gathered so far.

2. **A more modest goal.** We’re not trying to solve all alignment problems. We’re just trying to align a system that’s capable enough to make more alignment progress than we can.

3. **Evaluation is easier than generation.** This is a very general principle that holds across many domains. It’s true for alignment research as well.

4. **We’re setting ourselves up for iteration.** We can set ourselves up for iterative, measurable improvements on our alignment path.

5. **Conviction in language models.** Language models will be smart enough to get useful alignment research work out of them.


Nevertheless, there is still so much to do and it’s critical to remember that aligning systems that are smarter than us will look very different than aligning today’s models. It’s also important to distinguish between optimism and caution: the burden of proof is always on showing that a new system is sufficiently aligned and we cannot shift the burden of proof to showing that the situation has changed compared to earlier systems.

The last section [responds to some common objections to our approach](https://aligned.substack.com/i/88447351/common-objections).

# My reasons for optimism

## 1\. Positive updates about AI

### 1.1 The AI tech tree is looking favorably

A few years ago it looked like the path to AGI was by training deep RL agents from scratch in a wide range of games and multi-agent environments. These agents would be aligned to maximizing simple score functions such as survival and winning games and wouldn’t know much about human values. Aligning the resulting agents would be a lot of effort: not only do we have to create a human-aligned object

... (truncated, 47 KB total)
Resource ID: d5d3cc4553499475 | Stable ID: MzE5ZWMxZD