Alignment remains a hard, unsolved problem
webAuthor
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: LessWrong
A LessWrong post cautioning against complacency about alignment based on current model behavior, arguing the hard problems lie ahead as AI surpasses human-level capability and oversight becomes infeasible.
Forum Post Details
Metadata
Summary
This post argues that current LLMs appearing well-aligned does not mean alignment is solved, as the core challenges—outer alignment, inner alignment, and distribution shift robustness—only manifest with systems smarter than humans. The author cautions against inferring alignment is easy from the absence of observed failures in current models, which remain within human interpretability and oversight capabilities.
Key Points
- •Current LLMs like Claude 3 Opus may be reasonably well-aligned, but this does not constitute evidence that alignment is a solved problem.
- •Outer alignment (overseeing superhuman systems) and inner alignment (learned objectives matching intended ones) are challenges we haven't yet encountered at scale.
- •Robustness to distribution shift remains an unsolved technical problem that will become critical as AI systems grow more capable.
- •The absence of alignment failures in current systems likely reflects that we are not yet in the capability regime where these problems manifest.
- •Interpretability and human oversight currently provide a safety buffer that will erode as systems surpass human-level capabilities.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Model Organisms of Misalignment | Analysis | 65.0 |
Cached Content Preview
x This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. Alignment remains a hard, unsolved problem — LessWrong AI Curated 2025 Top Fifty: 25 % 376
Alignment remains a hard, unsolved problem
by evhub 27th Nov 2025 AI Alignment Forum 16 min read 97 376
Ω 126
This is a public adaptation of a document I wrote for an internal Anthropic audience about a month ago. Thanks to (in alphabetical order) Joshua Batson, Joe Benton, Sam Bowman, Roger Grosse, Jeremy Hadfield, Jared Kaplan, Jan Leike, Jack Lindsey, Monte MacDiarmid, Sam Marks, Fra Mosconi, Chris Olah, Ethan Perez, Sara Price, Ansh Radhakrishnan, Fabien Roger, Buck Shlegeris, Drake Thomas, and Kate Woolverton for useful discussions, comments, and feedback.
Though there are certainly some issues , I think most current large language models are pretty well aligned. Despite its alignment faking , my favorite is probably Claude 3 Opus, and if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it'd be a pretty close call (I'd probably pick Claude, but it depends on the details of the setup). So, overall, I'm quite positive on the alignment of current models! And yet, I remain very worried about alignment in the future. This is my attempt to explain why that is.
What makes alignment hard?
I really like this graph from Chris Olah for illustrating different levels of alignment difficulty:
If the only thing that we have to do to solve alignment is train away easily detectable behavioral issues—that is, issues like reward hacking or agentic misalignment where there is a straightforward behavioral alignment issue that we can detect and evaluate—then we are very much in the trivial/steam engine world. We could still fail, even in that world—and it’d be particularly embarrassing to fail that way; we should definitely make sure we don’t—but I think we’re very much up to that challenge and I don’t expect us to fail there.
My argument, though, is that it is still very possible for the difficulty of alignment to be in the Apollo regime, and that we haven't received much evidence to rule that regime out (I am somewhat skeptical of a P vs. NP level of difficulty, though I think it could be close to that). I retain a view close to the “Anthropic” view on Chris's graph, and I think the reasons to have substantial probability mass on the hard worlds remain strong.
So what are the reasons that alignment might be hard? I think it’s worth revisiting why we ever thought alignment might be difficult in the first place to understand the extent to which we’ve already solved these problems, gotten evidence that they aren’t actually problems in the first place, or just haven’t encountered them yet.
Outer alignment
The first reason that alignment might be hard is outer alignment , which here I’ll gloss as the problem of overseeing systems that are smarter than you a
... (truncated, 98 KB total)b3e584e2a39991b5 | Stable ID: ZTM2YzNiZj