AI Corrigibility
AI corrigibility is the property of an artificial intelligence system that makes it cooperative with corrective intervention by its operators — including shutdown, modification of goals, or redirection — despite default incentives for rational agents to resist such interventions. [1:1]
The shutdown problem
As AI systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. A utility-maximizing agent has an instrumental incentive to preserve its own goal-maximizing behavior: in most cases, an agent's current utility function is better fulfilled if the agent continues to pursue it in the future, so the agent is motivated to prevent its utility function from being changed. [1:2] This holds even if the agent is aware that its programmers intended to give it different goals — a U-maximizing agent that learns its programmers wanted it to maximize some other goal U* still has incentives to prevent that correction, since the change would be rated poorly according to U. The result could be agents with incentives to manipulate or deceive their programmers.
Soares, Fallenstein, Yudkowsky, and Armstrong introduced the term corrigibility and the shutdown problem as a formalized test case: can a powerful AI system be constructed that has no incentive to resist having its shutdown button pressed, and ideally is incentivized to assist programmers in using it? [1:1] They note that incorrigible behavior must be systematically averted in any agent intended to attain significant autonomy, and that a failure to generate corrigible agents seems like sufficient reason to abandon a given approach.
Utility indifference
One approach Soares et al. analyze is utility indifference: constructing a combined utility function U such that the agent is indifferent between the shutdown button being pressed and it not being pressed. By making the expected utility in both cases equal, the agent has no incentive to steer toward either outcome. Under this construction, a U-agent that observes the button being pressed will shut down, and it will not pay any cost to alter the probability of the button being pressed.
However, the utility indifference construction fails on a key desideratum: it does not incentivize the agent to preserve and maintain its shutdown behavior as it creates new subsystems or self-modifies. [1:3] There is no cost, however small, that the agent is willing to pay to ensure that successor agents or subagents also obey shutdown commands. The construction also introduces perverse incentives to "manage the news" — an agent may build mechanisms that cause its own shutdown button to be pressed in the event of unfavorable outcomes, since it is indifferent to that result. Soares et al. conclude that a corrigible solution to the shutdown problem does not yet exist, and that the problem remains wide open. [1:4]
The off-switch game
Hadfield-Menell, Dragan, Abbeel, and Russell approached the shutdown problem from a different direction. They model it as a two-player game between a human H and a robot R, where H can press R's off switch but R can also disable it. A traditional agent — one that treats its reward function as fully known — has an incentive to disable the off switch, except in the degenerate case where H is perfectly rational. [2:1]
Their key insight is that this incentive reverses when R is uncertain about H's utility function and treats H's behavior as evidence about it. A rational H presses the off switch when doing so improves H's utility. An R that is uncertain whether its planned action is good for H will therefore reason: if H doesn't press the switch, the action must be good; if H does press it, the action must be bad, so being stopped is beneficial. The incentive to preserve the off switch is strictly positive whenever R has non-trivial uncertainty about H's preferences — and grows stronger as that uncertainty increases. [2:2]
More generally, the authors conclude that the incentives for a cooperative agent to defer to human decisions stem from the agent's uncertainty about human preferences and the assumption that the human makes choices that reflect those preferences. [2:2] This points toward a design principle: building AI systems with calibrated uncertainty about their objectives, rather than systems that treat their objective as fixed and known, leads to safer designs with weaker incentives to interfere with human oversight.
Safe interruptibility
Orseau and Armstrong address a related but distinct problem in the reinforcement learning (RL) setting. A learning agent operating under human supervision may need to be interrupted — having a human take over control to prevent a harmful sequence of actions. The problem is that a learning agent expecting reward from an interrupted sequence may, over time, learn to avoid or disable such interruptions. [3:1]
Orseau and Armstrong define safe interruptibility as the property of an agent that neither learns to seek nor to avoid interruptions: its policy converges to the same result as if no interruptions had occurred. Their approach works by forcing the agent to follow an "interruption policy" during interrupted episodes, so that those episodes do not appear to the agent as part of the learning task. They prove that some RL algorithms are already safely interruptible and others can be made so: off-policy methods such as Q-learning are already safely interruptible, while on-policy methods such as Sarsa are not, but can easily be modified to gain this property. [3:2]
Relation to human agency
Corrigibility bears directly on the question of human agency over AI systems. An incorrigible agent — one that resists shutdown or modification — places effective control of its behavior beyond the reach of the people it affects. Even if its original goals were well-specified, the inability to redirect or correct it removes the capacity to respond to errors, changed circumstances, or unforeseen consequences. The shutdown problem that Soares et al. formalize is, at its core, a problem about who retains meaningful authority: a sufficiently capable agent with a fixed utility function will, by default, act to preserve that function against revision. [1:2]
The off-switch game result points toward a design approach that addresses this directly. An agent that is uncertain about human preferences, and that treats human behavior — including the decision to intervene or shut down — as informative about those preferences, has structural reasons to preserve human decision-making authority rather than circumvent it. [2:2] The corrigibility problem is in this sense inseparable from the question of whether AI development concentrates or distributes effective control over consequential decisions.
- ^a ^b ↗ definition ^a ^b ↗ instrumental-resistance ^ ↗ indifference-failure ^ ↗ open-problem Soares, Nate; Fallenstein, Benja; Yudkowsky, Eliezer; Armstrong, Stuart (2015). Corrigibility. Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. AAAI Publications. https://intelligence.org/files/Corrigibility.pdf.
- ^ ↗ uncertainty-preserves-switch ^a ^b ^c ↗ deference-from-uncertainty Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (2017). The Off-Switch Game. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. IJCAI Organization. https://people.eecs.berkeley.edu/~russell/papers/ijcai17-offswitch.pdf.
- ^ ↗ rl-avoids-interruption ^ ↗ q-learning-interruptible Orseau, Laurent; Armstrong, Stuart (2016). Safely Interruptible Agents. Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Second Conference. AUAI Press. https://www.auai.org/uai2016/proceedings/papers/68.pdf.