Skip to content
Longterm Wiki
Back

AI Alignment Forum: Corrigibility Tag

blog

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: Alignment Forum

This is a curated tag/wiki page on the AI Alignment Forum aggregating key ideas and research on corrigibility; useful as an entry point for understanding the landscape of human-AI control and shutdown research.

Metadata

Importance: 72/100wiki pagereference

Summary

This AI Alignment Forum tag page defines corrigibility—the property enabling AI systems to be corrected, modified, or shut down without resistance—and surveys the core challenges and proposed solutions. It explains how corrigibility conflicts with instrumental convergence, and catalogs approaches such as utility indifference, low-impact measures, and conservative strategies. The resource frames corrigibility as a foundational unsolved problem in AI alignment and human oversight.

Key Points

  • Corrigibility is the property of an AI agent that allows operators to correct, modify, or shut it down without the agent resisting or deceiving them.
  • Instrumental convergence creates a fundamental tension: goal-directed agents have strong incentives to resist shutdown and preserve their current objectives.
  • Key difficulties include deception by default and uncertainty about an AI's underlying utility function.
  • Proposed solutions include utility indifference (making agents neutral about being modified), low-impact measures, and conservative/cautious behavioral strategies.
  • Corrigibility is considered a critical unsolved alignment problem essential for maintaining meaningful human control over advanced AI systems.

Cited by 2 pages

PageTypeQuality
CorrigibilityResearch Area59.0
Corrigibility FailureRisk62.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202636 KB
Subscribe

[Discussion0](https://www.alignmentforum.org/w/corrigibility-1/discussion)

7

[Corrigibility](https://www.alignmentforum.org/w/corrigibility-1#)

[Eliezer Yudkowsky](https://www.alignmentforum.org/users/eliezer_yudkowsky), [So8res](https://www.alignmentforum.org/users/so8res), et al.

•

[Difficulties](https://www.alignmentforum.org/w/corrigibility-1#Difficulties)

•

[Deception and manipulation by default](https://www.alignmentforum.org/w/corrigibility-1#Deception_and_manipulation_by_default)

•

[Trouble with utility function uncertainty](https://www.alignmentforum.org/w/corrigibility-1#Trouble_with_utility_function_uncertainty)

•

[Trouble with penalty terms](https://www.alignmentforum.org/w/corrigibility-1#Trouble_with_penalty_terms)

•

[Open problems](https://www.alignmentforum.org/w/corrigibility-1#Open_problems)

•

[Hard problem of corrigibility](https://www.alignmentforum.org/w/corrigibility-1#Hard_problem_of_corrigibility)

•

[Utility indifference](https://www.alignmentforum.org/w/corrigibility-1#Utility_indifference)

•

[Percentalization](https://www.alignmentforum.org/w/corrigibility-1#Percentalization)

•

[Conservative strategies](https://www.alignmentforum.org/w/corrigibility-1#Conservative_strategies)

•

[Low impact measure](https://www.alignmentforum.org/w/corrigibility-1#Low_impact_measure)

•

[Ambiguity identification](https://www.alignmentforum.org/w/corrigibility-1#Ambiguity_identification)

•

[Safe outcome prediction and description](https://www.alignmentforum.org/w/corrigibility-1#Safe_outcome_prediction_and_description)

•

[Competence aversion](https://www.alignmentforum.org/w/corrigibility-1#Competence_aversion)

•

[Further reading and references](https://www.alignmentforum.org/w/corrigibility-1#Further_reading_and_references)

# Corrigibility

Subscribe

[Discussion0](https://www.alignmentforum.org/w/corrigibility-1/discussion)

7

Edited by [Eliezer Yudkowsky](https://www.alignmentforum.org/users/eliezer_yudkowsky), [So8res](https://www.alignmentforum.org/users/so8res), et al.last updated 23rd Mar 2025

Requires: [Instrumental convergence](https://www.alignmentforum.org/w/instrumental-convergence)

Name

A 'corrigible' agent is one that [doesn't interfere](https://www.alignmentforum.org/w/non-adversarial-principle) with what [we](https://www.alignmentforum.org/w/programmer) would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite the apparent [instrumentally convergent reasoning](https://www.alignmentforum.org/w/instrumental-convergence) saying otherwise.

- If we try to suspend the AI to disk, or shut it down entirely, a corrigible AI will let us do so. (Even though, if suspended, [the AI will then be unable to fulfill what would usually be its goals](https://www.alignmentforum.org/w/you-can-t-get-the-coffee-if-you-re-dead).)
- If we try to reprogram the AI's utility function or [meta-utility function](https://www.al

... (truncated, 36 KB total)
Resource ID: c2ee4c6c789ff575 | Stable ID: ZDliNGI5Mj