Quintin Pope and collaborators

blog

Alignment Forum·alignmentforum.org/s/nyEFg3AuJpdAozmoX

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

A foundational sequence introducing Shard Theory as an alternative to reward-centric alignment frameworks, influential in LessWrong/Alignment Forum circles for reframing how learned values and inner alignment are conceptualized.

Metadata

Importance: 78/100blog postprimary source

Summary

Shard Theory is a research sequence by Quintin Pope, Alex Turner, and collaborators proposing that AI values emerge as multiple competing 'shards'—context-dependent learned behavioral dispositions—rather than a single optimized reward signal. The framework challenges classical alignment decompositions like inner/outer alignment and draws on human value formation as evidence. It offers alternative design principles for building aligned agents.

Key Points

•Values in trained agents emerge as multiple competing 'shards'—context-triggered behavioral patterns—not monolithic reward optimization.
•Reward signals during training shape but do not define agent goals; 'reward is not the optimization target.'
•The inner/outer alignment decomposition may be counterproductive, replacing one hard problem with two even harder ones.
•Human value formation provides underutilized empirical evidence for how aligned AI values could develop naturally through training.
•Proposes concrete alignment-relevant design principles: avoid adversarial input exploitation and avoid aligning agents to plan evaluations.

Cited by 1 page

Page	Type	Quality
Why Alignment Might Be Easy	Argument	53.0

Cached Content Preview

HTTP 200Fetched Mar 15, 20263 KB

![](https://res.cloudinary.com/lesswrong-2-0/image/upload/c_fill,dpr_1.0,g_custom,h_380,q_auto,w_1919/v1/sequences/ot2siejtvcl9pvzly2ma)

# Shard Theory

_Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:_

![](https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/a6a9e89a88f93ea26fc1150c0634979f913c5528c1d78d06.png)

59[Humans provide an untapped wealth of evidence about alignment](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/CjFZeDD6iCnNubDoS)

[TurnTrout](https://www.alignmentforum.org/users/turntrout), [Quintin Pope](https://www.alignmentforum.org/users/quintin-pope)
4y

42

42[Human values & biases are inaccessible to the genome](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/CQAMdzA4MZEhNRtTp)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
4y

38

26[General alignment properties](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/FMdGt9S9irgxeD9Xz)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
4y

2

29[Evolution is a bad analogy for AGI: inner alignment](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/FyChg3kYG54tEN3u6)

[Quintin Pope](https://www.alignmentforum.org/users/quintin-pope)
4y

1

94[Reward is not the optimization target](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/pdaGN6pQyQarFHXF4)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
4y

88

74[The shard theory of human values](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/iCfdcxiyr2Kj8m8mT)

[Quintin Pope](https://www.alignmentforum.org/users/quintin-pope), [TurnTrout](https://www.alignmentforum.org/users/turntrout)
3y

33

23[Understanding and avoiding value drift](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/jFvFreCeejRKaZv4v)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
4y

7

36[A shot at the diamond-alignment problem](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/k4AQqboXz8iE5TNXK)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
3y

45

32[Don't design agents which exploit adversarial inputs](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/jFCK9JRLwkoJX4aJA)

[TurnTrout](https://www.alignmentforum.org/users/turntrout), [Garrett Baker](https://www.alignmentforum.org/users/d0themath)
3y

33

24[Don't align agents to evaluations of plans](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/fopZesxLCGAXqqaPv)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
3y

32

29[Alignment allows "nonrobust" decision-influences and doesn't require robust grading](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/rauMEna2ddf26BqiE)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
3y

31

43[Inner and outer alignment decompose one hard problem into two extremely hard problems](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/gHefoxiznGfsbiAu9)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
3y

14

x

Shard Theory — AI Alignment Forum

reCAPTCHA

Recaptcha requires 

... (truncated, 3 KB total)

Resource ID: 533f1062192748de | Stable ID: sid_GaHIonX7Lv