200 Concrete Open Problems in Mechanistic Interpretability — AI Alignment Forum

blog

Alignment Forum·alignmentforum.org/s/yivyHaCAmMJ3CqSyj

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

Metadata

Cited by 1 page

Page	Type	Quality
Neel Nanda	Person	26.0

Cached Content Preview

HTTP 200Fetched Apr 30, 20263 KB

![](https://res.cloudinary.com/lesswrong-2-0/image/upload/c_fill,dpr_1.0,g_custom,h_380,q_auto,w_1920/v1/sequences/vnyzzznenju0hzdv6pqb.jpg)

# 200 Concrete Open Problems in Mechanistic Interpretability

17[Concrete Steps to Get Started in Transformer Mechanistic Interpretability](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj/p/9ezkEb9oGvEi6WoB3)

[Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1)
3y

5

41[200 Concrete Open Problems in Mechanistic Interpretability: Introduction](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj/p/LbrPTJ4fmABEdEnLf)

[Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1)
3y

0

18[200 COP in MI: The Case for Analysing Toy Language Models](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj/p/GWCgZrzWCZCuzGktv)

[Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1)
3y

2

8[200 COP in MI: Looking for Circuits in the Wild](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj/p/XNjRwEX9kxbpzWFWd)

[Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1)
3y

3

17[200 COP in MI: Interpreting Algorithmic Problems](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj/p/ejtFsvyhRkMofKAFy)

[Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1)
3y

0

18[200 COP in MI: Exploring Polysemanticity and Superposition](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj/p/o6ptPu7arZrqRCxyz)

[Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1)
3y

1

11[200 COP in MI: Analysing Training Dynamics](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj/p/hHaXzJQi6SKkeXzbg)

[Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1)
3y

0

7[200 COP in MI: Techniques, Tooling and Automation](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj/p/btasQF7wiCYPsr5qw)

[Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1)
3y

0

10[200 COP in MI: Image Model Interpretability](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj/p/caMoe6yNfXcaCG2u3)

[Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1)
3y

1

10[200 COP in MI: Interpreting Reinforcement Learning](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj/p/eqvvDM25MXLGqumnf)

[Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1)
3y

0

11[200 COP in MI: Studying Learned Features in Language Models](https://www.alignmentforum.org/s/yivyHaCAmMJ3CqSyj/p/Qup9gorqpd9qKAEav)

[Neel Nanda](https://www.alignmentforum.org/users/neel-nanda-1)
3y

2

x

200 Concrete Open Problems in Mechanistic Interpretability — AI Alignment Forum

reCAPTCHA

Recaptcha requires verification.

protected by **reCAPTCHA**

Resource ID: 856cb0a13a71ff2c | Stable ID: sid_vwWkgXPGow