Skip to content
Longterm Wiki
Back

Quintin Pope and collaborators

blog

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

A foundational sequence introducing Shard Theory as an alternative to reward-centric alignment frameworks, influential in LessWrong/Alignment Forum circles for reframing how learned values and inner alignment are conceptualized.

Metadata

Importance: 78/100blog postprimary source

Summary

Shard Theory is a research sequence by Quintin Pope, Alex Turner, and collaborators proposing that AI values emerge as multiple competing 'shards'—context-dependent learned behavioral dispositions—rather than a single optimized reward signal. The framework challenges classical alignment decompositions like inner/outer alignment and draws on human value formation as evidence. It offers alternative design principles for building aligned agents.

Key Points

  • Values in trained agents emerge as multiple competing 'shards'—context-triggered behavioral patterns—not monolithic reward optimization.
  • Reward signals during training shape but do not define agent goals; 'reward is not the optimization target.'
  • The inner/outer alignment decomposition may be counterproductive, replacing one hard problem with two even harder ones.
  • Human value formation provides underutilized empirical evidence for how aligned AI values could develop naturally through training.
  • Proposes concrete alignment-relevant design principles: avoid adversarial input exploitation and avoid aligning agents to plan evaluations.

Cited by 1 page

PageTypeQuality
Why Alignment Might Be EasyArgument53.0

Cached Content Preview

HTTP 200Fetched Mar 15, 20263 KB
![](https://res.cloudinary.com/lesswrong-2-0/image/upload/c_fill,dpr_1.0,g_custom,h_380,q_auto,w_1919/v1/sequences/ot2siejtvcl9pvzly2ma)

# Shard Theory

_Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:_

![](https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/a6a9e89a88f93ea26fc1150c0634979f913c5528c1d78d06.png)

59[Humans provide an untapped wealth of evidence about alignment](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/CjFZeDD6iCnNubDoS)

[TurnTrout](https://www.alignmentforum.org/users/turntrout), [Quintin Pope](https://www.alignmentforum.org/users/quintin-pope)
4y

42

42[Human values & biases are inaccessible to the genome](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/CQAMdzA4MZEhNRtTp)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
4y

38

26[General alignment properties](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/FMdGt9S9irgxeD9Xz)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
4y

2

29[Evolution is a bad analogy for AGI: inner alignment](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/FyChg3kYG54tEN3u6)

[Quintin Pope](https://www.alignmentforum.org/users/quintin-pope)
4y

1

94[Reward is not the optimization target](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/pdaGN6pQyQarFHXF4)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
4y

88

74[The shard theory of human values](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/iCfdcxiyr2Kj8m8mT)

[Quintin Pope](https://www.alignmentforum.org/users/quintin-pope), [TurnTrout](https://www.alignmentforum.org/users/turntrout)
3y

33

23[Understanding and avoiding value drift](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/jFvFreCeejRKaZv4v)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
4y

7

36[A shot at the diamond-alignment problem](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/k4AQqboXz8iE5TNXK)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
3y

45

32[Don't design agents which exploit adversarial inputs](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/jFCK9JRLwkoJX4aJA)

[TurnTrout](https://www.alignmentforum.org/users/turntrout), [Garrett Baker](https://www.alignmentforum.org/users/d0themath)
3y

33

24[Don't align agents to evaluations of plans](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/fopZesxLCGAXqqaPv)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
3y

32

29[Alignment allows "nonrobust" decision-influences and doesn't require robust grading](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/rauMEna2ddf26BqiE)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
3y

31

43[Inner and outer alignment decompose one hard problem into two extremely hard problems](https://www.alignmentforum.org/s/nyEFg3AuJpdAozmoX/p/gHefoxiznGfsbiAu9)

[TurnTrout](https://www.alignmentforum.org/users/turntrout)
3y

14

x

Shard Theory — AI Alignment Forum

reCAPTCHA

Recaptcha requires 

... (truncated, 3 KB total)
Resource ID: 533f1062192748de | Stable ID: YTM5Nzg4M2