Alignment remains a hard, unsolved problem

web

2025·LessWrong·lesswrong.com/posts/epjuxGnSPof3GnMSL/alignment-remains-a...

Author

evhub

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

A LessWrong post cautioning against complacency about alignment based on current model behavior, arguing the hard problems lie ahead as AI surpasses human-level capability and oversight becomes infeasible.

Forum Post Details

Karma

377

Comments

Forum

lesswrong

Status

Curated

Forum Tags

Metadata

Importance: 62/100blog postcommentary

Summary

This post argues that current LLMs appearing well-aligned does not mean alignment is solved, as the core challenges—outer alignment, inner alignment, and distribution shift robustness—only manifest with systems smarter than humans. The author cautions against inferring alignment is easy from the absence of observed failures in current models, which remain within human interpretability and oversight capabilities.

Key Points

•Current LLMs like Claude 3 Opus may be reasonably well-aligned, but this does not constitute evidence that alignment is a solved problem.
•Outer alignment (overseeing superhuman systems) and inner alignment (learned objectives matching intended ones) are challenges we haven't yet encountered at scale.
•Robustness to distribution shift remains an unsolved technical problem that will become critical as AI systems grow more capable.
•The absence of alignment failures in current systems likely reflects that we are not yet in the capability regime where these problems manifest.
•Interpretability and human oversight currently provide a safety buffer that will erode as systems surpass human-level capabilities.

Cited by 1 page

Page	Type	Quality
Model Organisms of Misalignment	Analysis	65.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202627 KB

# Alignment remains a hard, unsolved problem
By evhub
Published: 2025-11-27
_This is a public adaptation of a document I wrote for an internal Anthropic audience about a month ago. Thanks to (in alphabetical order) Joshua Batson, Joe Benton, Sam Bowman, Roger Grosse, Jeremy Hadfield, Jared Kaplan, Jan Leike, Jack Lindsey, Monte MacDiarmid, Sam Marks, Fra Mosconi, Chris Olah, Ethan Perez, Sara Price, Ansh Radhakrishnan, Fabien Roger, Buck Shlegeris, Drake Thomas, and Kate Woolverton for useful discussions, comments, and feedback._

Though there are certainly [some issues](https://www.lesswrong.com/posts/b8eeCGe3FWzHKbePF/agentic-misalignment-how-llms-could-be-insider-threats-1), I think most current large language models are pretty well aligned. Despite its [alignment faking](https://www.lesswrong.com/posts/njAZwT8nkHnjipJku/alignment-faking-in-large-language-models), my favorite is probably Claude 3 Opus, and if you asked me to pick between the [CEV](https://www.lesswrong.com/w/coherent-extrapolated-volition) of Claude 3 Opus and that of a median human, I think it'd be a pretty close call (I'd probably pick Claude, but it depends on the details of the setup). So, overall, I'm quite positive on the alignment of current models! And yet, I remain very worried about alignment in the future. This is my attempt to explain why that is.

# What makes alignment hard?

I really like this graph [from Chris Olah](https://x.com/ch402/status/1666482929772666880?lang=en) for illustrating different levels of alignment difficulty:

![How Hard Is AI Safety? A graph with three distributions over difficulty "Safety isn't a problem view" (focused on easy), "Anthropic" (widely distributed), and "Pessimistic Safety View" (focused on AI is very hard).](https://pbs.twimg.com/media/FyCIR2paYAEKIEL?format=jpg&name=large)

If the only thing that we have to do to solve alignment is train away easily detectable behavioral issues—that is, issues like reward hacking or [agentic misalignment](https://www.lesswrong.com/posts/b8eeCGe3FWzHKbePF/agentic-misalignment-how-llms-could-be-insider-threats-1) where there is a straightforward behavioral alignment issue that we can detect and evaluate—then we are very much in the trivial/steam engine world. We could still fail, even in that world—and it’d be particularly embarrassing to fail that way; we should definitely make sure we don’t—but I think we’re very much up to that challenge and I don’t expect us to fail there.

My argument, though, is that it is still very possible for the difficulty of alignment to be in the Apollo regime, and that we haven't received much evidence to rule that regime out (I am somewhat skeptical of a P vs. NP level of difficulty, though I think it could be close to that). I retain a view close to the “Anthropic” view on Chris's graph, and I think the reasons to have substantial probability mass on the hard worlds remain strong.

So what are the reasons that alignment might be hard? I think it’s worth revisit

... (truncated, 27 KB total)

Resource ID: b3e584e2a39991b5 | Stable ID: sid_E2dLenMkHJ