EA Forum - Some Quick Thoughts on AI Is Easy to Control

blog

2023·EA Forum·forum.effectivealtruism.org/posts/p4hGrvJrwqEGuDTaC/some-...

Author

MikhailSamin

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: EA Forum

This EA Forum post responds to optimistic claims about AI controllability, offering a counterargument relevant to debates about whether current alignment progress meaningfully addresses risks from future superintelligent systems.

Forum Post Details

Karma

Comments

Forum

eaforum

Forum Tags

AI safety

Metadata

Importance: 42/100blog postcommentary

Summary

Mikhail Samin critiques claims that AI is easy to control, arguing that such arguments conflate controllability of current subhuman systems with the fundamentally harder problem of controlling superintelligent systems. He contends that success in overseeing weaker AI cannot be bootstrapped to solve alignment for superhuman systems, as the capability jump represents a qualitative shift in the control problem requiring superhuman oversight.

Key Points

•Conflating controllability of current AI with alignment of superintelligent AI misrepresents the core safety challenge.
•Subhuman AI systems may be easier to oversee, but controlling superhuman systems requires oversight capabilities that humans don't yet possess.
•Success with weaker AI systems cannot be simply extrapolated to solve alignment for more capable, superhuman systems.
•The leap from subhuman to superhuman capability is a qualitative shift, not merely a quantitative one, making the control problem fundamentally harder.
•Alignment and controllability are distinct concepts, and conflating them obscures the real difficulty of the long-term safety problem.

Cited by 1 page

Page	Type	Quality
ControlAI	Organization	63.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202615 KB

# Some quick thoughts on "AI is easy to control"
By MikhailSamin
Published: 2023-12-07
There are many things I feel like the [post](https://optimists.ai/2023/11/28/ai-is-easy-to-control/) authors miss, and I want to share a few thoughts that seem good to communicate.

I'm going to focus on controlling superintelligent AI systems: systems powerful enough to solve alignment (in the [CEV](https://arbital.com/p/cev/) sense) completely, or to kill everyone on the planet. 

In this post, I'm going to ignore other AI-related sources of x-risk, such as [AI-enabled bioterrorism](https://www.judiciary.senate.gov/imo/media/doc/2023-07-26_-_testimony_-_amodei.pdf), and I'm not commenting on everything that seems important to comment on.

I'm also not going to point at all the slippery claims that I think can make the reader generalize incorrectly, as it'd be nitpicky and also not worth the time (examples of what I'd skip- I couldn't find evidence that GPT-4 has undergone any supervised fine-tuning; RLHF shapes chatbots' brains into the kind of systems that produce outputs that make human graders click on thumbs-up/"I prefer this text", smart systems that do that are not themselves necessarily "preferred" by human graders; one footnote[^MISSING-ID]).

Intro
-----

> many people are worried that we will lose control of artificial intelligence, leading to human extinction or a similarly catastrophic “AI takeover.” We hope the arguments in this essay make such an outcome seem implausible. But even if future AI turns out to be less “controllable” in a strict sense of the word— simply because, for example, it thinks faster than humans can directly supervise— we also argue it will be easy to *instill our values* into an AI, a process called “**alignment**.” 

This **misrepresents the worry**. Saying "but even if" makes it look like: people worrying about x-risk place credence on "loss of control leads to x-risk no matter/despite alignment"; and these people wrong, as the post shows "this outcome" to be implausible; and, separately, that even if they're right about loss of control, they're wrong about x-risk, as it'll be fine because of alignment.

But mostly, people (including the leading voices) are worried specifically about capable misaligned systems leading to human extinction. I don't know anyone in the community who'd say it's a bad thing that leads to extinction if a CEV-aligned superintelligence grabs control.

> Since each generation of controllable AIs can help control the next generation, it looks like this process can continue indefinitely, even to very high levels of capability

I expect it to be easy to reward-shape AIs below a certain level[^MISSING-ID] of capability, and I worry about controlling AIs above that level. I believe you need a superhumanly capable system to design and oversee a superhumanly capable system so that it doesn't kill everyone. The current ability of subhuman systems to oversee other subhuman systems such that these systems do

... (truncated, 15 KB total)

Resource ID: a3c5ea4ebbb2c3ff | Stable ID: sid_RjCVgTDNL2