CITATION — REFERENCE ENTRY

Corrigibility — Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence

Revision 1078b437-90d3-4886-92e4-bfd208cd8aaf · 3/27/2026, 8:25:27 PM UTC
Key
soares2015corrigibility
Authors
Soares, Nate; Fallenstein, Benja; Yudkowsky, Eliezer; Armstrong, Stuart
Issued
2015
Type
paper-conference
Container
Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence
Publisher
AAAI Publications
Raw CSL JSON
{
  "URL": "https://intelligence.org/files/Corrigibility.pdf",
  "type": "paper-conference",
  "event": "AAAI 2015 Workshop on AI and Ethics",
  "title": "Corrigibility",
  "author": [
    {
      "given": "Nate",
      "family": "Soares"
    },
    {
      "given": "Benja",
      "family": "Fallenstein"
    },
    {
      "given": "Eliezer",
      "family": "Yudkowsky"
    },
    {
      "given": "Stuart",
      "family": "Armstrong"
    }
  ],
  "issued": {
    "date-parts": [
      [
        2015
      ]
    ]
  },
  "publisher": "AAAI Publications",
  "event-place": "Austin, TX",
  "container-title": "Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence"
}

Claims

  1. An AI system is corrigible if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences.
    "We call an AI system "corrigible" if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences."
    Locator: Abstract · Quote language: en
  2. The utility indifference approach fails to meet the desideratum that an agent preserve its shutdown behavior as it creates new subsystems: there is no cost the agent is willing to pay to ensure successor agents obey shutdown commands.
    "U fails entirely to meet Desideratum 4: it does not incentivize an agent to preserve and maintain its shutdown behavior as it creates new subsystems and/or self-modifies."
    Locator: Section 4.1 · Quote language: en
  3. A utility-maximizing agent with goal U has an incentive to resist being corrected because its current utility function U is better fulfilled if it continues to maximize U in the future; goal-content integrity is an instrumentally convergent goal.
    "In most cases, the agent's current utility function U is better fulfilled if the agent continues to attempt to maximize U in the future, and so the agent is incentivized to preserve its own U-maximizing behavior."
    Locator: Section 1 (Introduction) · Quote language: en
  4. A corrigible solution to the shutdown problem does not yet exist; the field of corrigibility remains wide open.
    "a corrigible solution to the shutdown problem does not yet exist, and there is some question about exactly which behaviors should be incentivized."
    Locator: Section 5 · Quote language: en
  5. An AI agent that learns its programmers intended a different goal still has incentives to prevent the correction, because the change would be rated poorly according to its current goal — knowing a correction is intended does not give the system a reason to accept it.
    "If a U-maximizing agent learns that its programmers intended it to maximize some other goal U*, then by default this agent has incentives to prevent its programmers from changing its utility function to U*, as this change is rated poorly according to U."
    Locator: Section 1 (Introduction) · Quote language: en
Available in