A Central AI Alignment Problem: Behavioral Disposition vs. World State Objectives

web

MIRI·intelligence.org/2022/07/04/a-central-ai-alignment-problem/

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: MIRI

A blog post by MIRI's Nate Soares outlining what he views as a foundational obstacle in AI alignment, relevant to researchers working on intent alignment, specification gaming, and the behavioral vs. dispositional objectives distinction.

Metadata

Importance: 62/100blog postanalysis

Summary

Nate Soares articulates what he considers a central challenge in AI alignment: ensuring that an AI system's objectives are grounded in actual world states rather than in behavioral dispositions or proxy measures. The post argues that misalignment between what we can specify behaviorally and what we actually want poses a fundamental obstacle to building safe advanced AI systems.

Key Points

•The core alignment problem involves getting AI systems to pursue goals defined over actual world states, not just behavioral outputs or easily measurable proxies.
•Behavioral specifications are insufficient because an AI optimizing for behavior can satisfy the letter of its objective while violating its spirit.
•The gap between what humans can observe/specify and what they actually value is a fundamental source of misalignment risk.
•This framing connects to broader MIRI research themes around mesa-optimization, inner alignment, and the difficulty of intent alignment.
•Soares frames this as a 'central' problem suggesting other alignment challenges may reduce to or flow from this core issue.

Cited by 1 page

Page	Type	Quality
Sharp Left Turn	Risk	69.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202619 KB

A central AI alignment problem: capabilities generalization, and the sharp left turn - Machine Intelligence Research Institute 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 

 
 
 
 
 
 
 
 
 

 
 
 
 
 Skip to content 

 
 
 
 
 
 
 
 
 
 
 
 
 A central AI alignment problem: capabilities generalization, and the sharp left turn

 
 
 
 
 
 
 
 
 July 4, 2022 
 
 

 
 
 
 Nate Soares 
 
 

 
 
 
 
 
 ( This post was factored out of a larger post that I (Nate Soares) wrote, with help from Rob Bensinger, who also rearranged some pieces and added some text to smooth things out. I&#8217;m not terribly happy with it, but am posting it anyway (or, well, having Rob post it on my behalf while I travel) on the theory that it&#8217;s better than nothing. )

 
 I expect navigating the acute risk period to be tricky for our civilization, for a number of reasons. Success looks to me to require clearing a variety of technical, sociopolitical, and moral hurdles, and while in principle sufficient mastery of solutions to the technical problems might substitute for solutions to the sociopolitical and other problems, it nevertheless looks to me like we need a lot of things to go right.

 Some sub-problems look harder to me than others. For instance, people are still regularly surprised when I tell them that I think the hard bits are much more technical than moral: it looks to me like figuring out how to aim an AGI at all is harder than figuring out where to aim it. [1] 

 Within the list of technical obstacles, there are some that strike me as more central than others, like &#8220;figure out how to aim optimization&#8221;. And a big reason why I&#8217;m currently fairly pessimistic about humanity&#8217;s odds is that it seems to me like almost nobody is focusing on the technical challenges that seem most central and unavoidable to me.

 Many people wrongly believe that I&#8217;m pessimistic because I think the alignment problem is extraordinarily difficult on a purely technical level. That&#8217;s flatly false, and is pretty high up there on my list of least favorite misconceptions of my views. [2] 

 I think the problem is a normal problem of mastering some scientific field, as humanity has done many times before. Maybe it&#8217;s somewhat trickier, on account of (e.g.) intelligence being more complicated than, say, physics; maybe it&#8217;s somewhat easier on account of how we have more introspective access to a working mind than we have to the low-level physical fields; but on the whole, I doubt it&#8217;s all that qualitatively different than the sorts of summits humanity has surmounted before.

 It&#8217;s made trickier by the fact that we probably have to attain mastery of general intelligence before we spend a bunch of time working with general intelligences (on account of how we seem likely to kill ourselves by accident within a few years, once we have AGIs on 

... (truncated, 19 KB total)

Resource ID: 83ae4cb7d004910a | Stable ID: sid_AN976irZzA