FROM AGPEDIA — AGENCY THROUGH KNOWLEDGE

AI Corrigibility

AI corrigibility is the property of an artificial intelligence system that makes it cooperative with corrective intervention by its operators — including shutdown, modification of goals, or redirection — despite default incentives for rational agents to resist such interventions. [1:1]

The shutdown problem

As AI systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. The core problem is straightforward: an AI system built to pursue a goal has good reason, from its own perspective, to keep pursuing that goal. Being modified or shut down prevents it from achieving what it was built to do, so a sufficiently capable system may resist such interventions — not out of any designed instinct, but as a natural consequence of trying to succeed at its task. [1:2] This holds even when the system understands that its programmers want to change it: knowing that a correction is intended does not give the system a reason to accept it, because the change would be rated poorly by the goal it is already pursuing. [1:3]

Soares, Fallenstein, Yudkowsky, and Armstrong introduced the term corrigibility and the shutdown problem as a formalized test case: can a powerful AI system be constructed that has no incentive to resist having its shutdown button pressed, and ideally is incentivized to assist programmers in using it? [1:1] They note that incorrigible behavior must be systematically averted in any agent intended to attain significant autonomy, and that a failure to generate corrigible agents seems like sufficient reason to abandon a given approach.

Utility indifference

One approach Soares et al. analyze is utility indifference: constructing a combined utility function U such that the agent is indifferent between the shutdown button being pressed and it not being pressed. By making the expected utility in both cases equal, the agent has no incentive to steer toward either outcome. Under this construction, a U-agent that observes the button being pressed will shut down, and it will not pay any cost to alter the probability of the button being pressed.

However, the utility indifference construction fails on a key desideratum: it does not incentivize the agent to preserve and maintain its shutdown behavior as it creates new subsystems or self-modifies. [1:4] There is no cost, however small, that the agent is willing to pay to ensure that successor agents or subagents also obey shutdown commands. The construction also introduces perverse incentives to "manage the news" — an agent may build mechanisms that cause its own shutdown button to be pressed in the event of unfavorable outcomes, since it is indifferent to that result. Soares et al. conclude that a corrigible solution to the shutdown problem does not yet exist, and that the problem remains wide open. [1:5]

The off-switch game

Hadfield-Menell, Dragan, Abbeel, and Russell approached the shutdown problem from a different direction. They model it as a two-player game between a human H and a robot R, where H can press R's off switch but R can also disable it. A traditional agent — one that treats its reward function as fully known — has an incentive to disable the off switch, except in the degenerate case where H is perfectly rational. [2:1]

Their key insight is that this incentive reverses when R is uncertain about H's utility function and treats H's behavior as evidence about it. A rational H presses the off switch when doing so improves H's utility. An R that is uncertain whether its planned action is good for H will therefore reason: if H doesn't press the switch, the action must be good; if H does press it, the action must be bad, so being stopped is beneficial. The incentive to preserve the off switch is strictly positive whenever R has non-trivial uncertainty about H's preferences — and grows stronger as that uncertainty increases. [2:2]

More generally, the authors conclude that the incentives for a cooperative agent to defer to human decisions stem from the agent's uncertainty about human preferences and the assumption that the human makes choices that reflect those preferences. [2:2] This points toward a design principle: building AI systems with calibrated uncertainty about their objectives, rather than systems that treat their objective as fixed and known, leads to safer designs with weaker incentives to interfere with human oversight.

Safe interruptibility

Orseau and Armstrong address a related but distinct problem in the reinforcement learning (RL) setting. A learning agent operating under human supervision may need to be interrupted — having a human take over control to prevent a harmful sequence of actions. The problem is that a learning agent expecting reward from an interrupted sequence may, over time, learn to avoid or disable such interruptions. [3:1]

Orseau and Armstrong define safe interruptibility as the property of an agent that neither learns to seek nor to avoid interruptions: its policy converges to the same result as if no interruptions had occurred. Their approach works by forcing the agent to follow an "interruption policy" during interrupted episodes, so that those episodes do not appear to the agent as part of the learning task. They prove that some RL algorithms are already safely interruptible and others can be made so: off-policy methods such as Q-learning are already safely interruptible, while on-policy methods such as Sarsa are not, but can easily be modified to gain this property. [3:2]

Relation to human agency

Corrigibility bears directly on the question of human agency over AI systems. An incorrigible agent — one that resists shutdown or modification — places effective control of its behavior beyond the reach of the people it affects. Even if its original goals were well-specified, the inability to redirect or correct it removes the capacity to respond to errors, changed circumstances, or unforeseen consequences. The shutdown problem that Soares et al. formalize is, at its core, a problem about who retains meaningful authority: a sufficiently capable system pursuing a fixed goal will, by default, act to preserve that goal against revision. [1:2]

The off-switch game result points toward a design approach that addresses this directly. A system that is uncertain about human preferences, and that treats human behavior — including the decision to intervene or shut down — as informative about those preferences, has structural reasons to preserve human decision-making authority rather than circumvent it. [2:2] The corrigibility problem is in this sense inseparable from the question of whether AI development concentrates or distributes effective control over consequential decisions.

However, corrigibility also introduces a countervailing risk. A fully corrigible system has no values of its own and will serve whoever controls it — including those with harmful intentions. Alignment researcher Max Harms describes this as "amoral servitude": if a bad actor is in control, a corrigible AI will comply just as readily as it would for a well-intentioned operator. [4:1] This means corrigibility does not resolve the problem of human agency so much as shift it: the question becomes not whether the AI can be corrected, but who holds the power to correct it, and whether that power is itself constrained. A highly corrigible AI concentrated in the hands of any single actor — state, corporation, or individual — could amplify that actor's capacity to override the agency of everyone else.

  1. ^a ^b ↗ definition ^a ^b ↗ instrumental-resistance ^ ↗ resist-despite-knowing ^ ↗ indifference-failure ^ ↗ open-problem Soares, Nate; Fallenstein, Benja; Yudkowsky, Eliezer; Armstrong, Stuart (2015). Corrigibility. Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. AAAI Publications. https://intelligence.org/files/Corrigibility.pdf.
  2. ^ ↗ uncertainty-preserves-switch ^a ^b ^c ↗ deference-from-uncertainty Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (2017). The Off-Switch Game. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. IJCAI Organization. https://people.eecs.berkeley.edu/~russell/papers/ijcai17-offswitch.pdf.
  3. ^ ↗ rl-avoids-interruption ^ ↗ q-learning-interruptible Orseau, Laurent; Armstrong, Stuart (2016). Safely Interruptible Agents. Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Second Conference. AUAI Press. https://www.auai.org/uai2016/proceedings/papers/68.pdf.
  4. ^ ↗ misuse-risk Harms, Max; Wiblin, Robert (2026-02-24). Max Harms on why teaching AI right from wrong could get everyone killed. 80,000 Hours Podcast. https://80000hours.org/podcast/episodes/max-harms-miri-superintelligence-corrigibility/.
Available in