Back
Not Covered: October 2024 Alignment - Bluedot Blog
webblog.bluedot.org·blog.bluedot.org/p/not-covered-2410-alignment
A supplementary blog post from BlueDot Impact's alignment course, useful for learners wanting brief introductions to developmental interpretability, agent foundations, and shard theory beyond the core curriculum.
Metadata
Importance: 35/100blog posteducational
Summary
BlueDot Impact introduces three AI alignment research areas omitted from their 8-week October 2024 course: developmental interpretability (studying model structure changes during training), agent foundations (theoretical research on agentic AI drawing from math and philosophy), and shard theory (framing RL agents as driven by multiple contextual 'shards' rather than unified goals). Each section provides brief overviews and pointers to deeper resources.
Key Points
- •Developmental interpretability studies phases and phase transitions during training, inspired by developmental biology, rather than static model features and circuits.
- •Agent foundations is theoretical research providing building blocks for understanding powerful agentic AI, drawing from mathematics, philosophy, and theoretical computer science.
- •Shard theory proposes RL agents are better understood as driven by many contextual 'shards' influencing decisions, rather than following specified goals.
- •These areas were excluded from the 8-week course due to scope constraints but represent active and emerging research agendas in alignment.
- •Each topic includes pointers to foundational reading (e.g., MIRI's Technical Agenda for agent foundations, LessWrong posts for dev interp and shard theory).
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Model Organisms of Misalignment | Analysis | 65.0 |
Cached Content Preview
HTTP 200Fetched Mar 15, 202613 KB
[](https://blog.bluedot.org/)
# [BlueDot Impact](https://blog.bluedot.org/)
SubscribeSign in

Discover more from BlueDot Impact
Subscribe
By subscribing, you agree Substack's [Terms of Use](https://substack.com/tos), and acknowledge its [Information Collection Notice](https://substack.com/ccpa#personal-data-collected) and [Privacy Policy](https://substack.com/privacy).
Already have an account? Sign in
[Blog](https://blog.bluedot.org/s/blog/?utm_source=substack&utm_medium=menu)
# What we didn't cover in our October 2024 AI Alignment course
[](https://substack.com/@domdomegg)
[Adam Jones](https://substack.com/@domdomegg)
Dec 12, 2024
Share
[An 8-week part-time course](https://aisafetyfundamentals.com/alignment-course-details/) can’t cover everything there is to know about [AI alignment](https://aisafetyfundamentals.com/blog/what-is-ai-alignment/), especially given it’s a fast-moving field with many budding research agendas.
This resource gives a brief introduction into a few areas we didn’t touch on, with pointers to resources if you want to explore them further.
## **Developmental interpretability**
We covered mechanistic interpretability (or “mech interp”) [in session 5](https://course.aisafetyfundamentals.com/alignment?session=5). To recap, this is an approach that ‘zooms in’ to models to make sense of their learned representations and weights through methods like feature or circuit analysis.
Developmental interpretability (or “dev interp”) instead studies how the structure of models changes as they are trained. Rather than features and circuits, here the object of study is phases and phase transitions during the training process. This takes inspiration from developmental biology, which increased our understanding of biology by studying the key steps of how living organisms develop.
[Read more](https://www.lesswrong.com/posts/TjaeCWvLZtEDAS5Ex/towards-developmental-interpretability).
## **Agent foundations**
Most of this course has explored fairly empirical research agendas that have been targeting the kinds of AI systems we have today: neural network architectures trained with gradient descent. However, it’s unclear whether we’ll continue to build future AI systems with similar technology. You’ll also h
... (truncated, 13 KB total)Resource ID:
27d9202885cc6bcf | Stable ID: YjNjNjI4ZD