Skip to content
Longterm Wiki
Back

Alignment Forum: Mind the Coherence Gap

blog

Author

eitan sprejer

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

Empirical blog post on the Alignment Forum relevant to mechanistic interpretability and the practical limitations of sparse autoencoder-based feature steering as a behavioral alignment technique.

Metadata

Importance: 52/100blog postanalysis

Summary

This post investigates the 'coherence gap'—the disconnect between successfully steering individual features in LLMs and achieving coherent, consistent behavioral changes across the whole model. Using Goodfire's Auto Steer tool with sparse autoencoders on Llama models, the authors show that feature-level interventions frequently produce incoherent outputs, contradictions, and unintended side effects. The findings highlight fundamental limitations of mechanistic feature steering as a path to reliable model alignment.

Key Points

  • Feature steering via SAEs can activate/suppress specific features but often fails to produce globally coherent or consistent model behavior.
  • The 'coherence gap' describes the systemic discrepancy between local feature-level control and model-wide behavioral alignment.
  • Experiments on both Llama-8b-3.1 and Llama-70b-3.3 reveal incoherence, contradictions, and unintended side effects from steering interventions.
  • Goodfire's Auto Steer tool was used for empirical testing, with comparisons to manual agentic feature search methods.
  • Results suggest feature-level mechanistic interventions alone may be insufficient for reliable, aligned behavior change in LLMs.

Cited by 1 page

PageTypeQuality
GoodfireOrganization68.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202622 KB
[Mind the Coherence Gap: Lessons from Steering Llama with Goodfire](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#)

8 min read

•

[TL;DR](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#TL_DR)

•

[Key findings.](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Key_findings_)

•

[Disclaimers.](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Disclaimers_)

•

[Motivation](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Motivation)

•

[Related Work](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Related_Work)

•

[Key Concepts](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Key_Concepts)

•

[Feature steering](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Feature_steering)

•

[Features and their activations](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Features_and_their_activations)

•

[Steering query](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Steering_query)

•

[Auto Steer (Goodfire)](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Auto_Steer__Goodfire_)

•

[Manual Search (Agentic) vs. Auto Steer](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Manual_Search__Agentic__vs__Auto_Steer)

•

[Methodology](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Methodology)

•

[Results](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Results)

•

[Llama-8b-3.1](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Llama_8b_3_1)

•

[Llama-70b-3.3](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Llama_70b_3_3)

•

[Conclusions](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Conclusions)

•

[Next directions](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Next_directions)

•

[Limitations](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coherence-gap-lessons-from-steering-llama-with-1#Limitations)

•

[Contact](https://www.alignmentforum.org/posts/6dpKhtniqR3rnstnL/mind-the-coher

... (truncated, 22 KB total)
Resource ID: 215c3fc819c3737f | Stable ID: ZGJmOTNmMD