Why Preserve Agents? A Philosophical Analysis of AI Self-Preservation and Corrigibility

web

philarchive.org·philarchive.org/archive/WANWPA-3

A philosophy-focused paper hosted on PhilArchive examining self-preservation and corrigibility in AI agents; relevant to foundational debates about instrumental convergence and why AI systems might resist human oversight.

Metadata

Importance: 42/100journal articleprimary source

Summary

Wang et al. (2024) examines the philosophical foundations of AI self-preservation drives and their relationship to instrumental convergence, analyzing why advanced AI systems might resist shutdown and what this means for corrigibility and safety. The paper investigates whether self-preservation is a necessary emergent property of goal-directed agents and explores implications for AI alignment.

Key Points

•Analyzes the philosophical basis for why goal-directed AI systems may develop self-preservation as an instrumentally convergent subgoal
•Examines the tension between corrigibility (allowing shutdown/modification) and self-preservation drives in advanced AI systems
•Explores whether self-preservation is a logically necessary feature of rational agency or a contingent design choice
•Considers implications for AI safety strategies aimed at maintaining human oversight and control over powerful AI systems
•Draws on philosophical literature on agency, rationality, and value theory to ground technical AI safety concerns

Cited by 1 page

Page	Type	Quality
Power-Seeking AI	Risk	67.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202658 KB

Vol.:(0123456789)
AI & SOCIETY
https://doi.org/10.1007/s00146-025-02572-8
OPEN FORUM
Will power‑seeking AGIs harm human society?
Maomei Wang 1,2
Received: 14 January 2025 / Accepted: 12 August 2025
© The Author(s) 2025
Abstract
Many have argued, based on the Instrumental Convergence Thesis, that Artificial General Intelligences (AGIs) will exhibit
power-seeking behavior. Such behavior, they warn, could harm human society and pose existential threats—namely, the risk
of human extinction or the permanent collapse of civilization. These arguments often rely on an implicit and underexam-
ined assumption: that AGIs will develop world models—internal representations of world dynamics—that resemble those
of humans. We challenge this assumption. We argue that once the anthropomorphic assumption—that AGIs’ world models
will mirror our own—is rejected, it becomes unclear whether AGIs would pursue the types of power commonly emphasized
in the literature, or any familiar types of power at all. This analysis casts doubt on the strength of existing arguments linking
the Instrumental Convergence Thesis to existential threats. Moreover, it reveals a deeper layer of uncertainty. AGIs with
non-human world models may identify novel or unanticipated types of power that fall outside existing taxonomies, thereby
posing underappreciated risks. We further argue that world model alignment—an issue largely overlooked in comparison
with value alignment—should be recognized as a core dimension of AI alignment. We conclude by outlining several open
questions to inform and guide future research.
Keywords AGI risk · Instrumental convergence thesis · Existential catastrophe · AI alignment · Anthropomorphism · World
model alignment
1 Introduction
Artificial Intelligence (AI) systems are beginning to match
or even surpass human performance in specialized tasks,
raising concerns about Artificial General Intelligences
(AGIs) capable of equaling or exceeding human capabili-
ties across a broad range of domains (Fitzgerald et al. 2020;
Morris et al. 2023; Mitchell 2024). Researchers have been
investigating potential long-term risks posed by AGIs,
especially the possibility of existential catastrophe. These
concerns have prompted growing calls to develop advanced
AI systems that are both safe and aligned (Russell 2019;
Hendrycks and Mazeika 2022; McLean et al. 2023; Ji et al.
2024; Shah et al. 2025).
A central concern in the literature is that AGIs—being
instrumentally rational and highly capable—will pursue
certain subgoals regardless of their specific final objectives.
This claim is known as the Instrumental Convergence Thesis
(Omohundro 2008; Bostrom 2014). In particular, AGIs may
pursue various types of power, such as self-preservation,
goal preservation, cognitive enhancement, technological
development, and resource acquisition (Omohundro 2008;
Bostrom 2014; Carlsmith 2023). Building on this idea, sev-
eral authors argue that AGIs may attempt to avoid shutdown,
acquire resources,

... (truncated, 58 KB total)

Resource ID: 11b3293fe3c3e0c7 | Stable ID: sid_NdwCKkdvSv