CITATION — REFERENCE ENTRY

The Off-Switch Game — Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence

Revision 0c5319af-1ccc-4aaf-93f0-91b2fcab7005 · 3/27/2026, 8:25:32 PM UTC

Key: hadfield-menell2017offswitch
Authors: Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart
Issued: 2017
Type: paper-conference
Container: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
Publisher: IJCAI Organization
URL: https://people.eecs.berkeley.edu/~russell/papers/ijcai17-offswitch.pdf

Raw CSL JSON

{
  "URL": "https://people.eecs.berkeley.edu/~russell/papers/ijcai17-offswitch.pdf",
  "type": "paper-conference",
  "title": "The Off-Switch Game",
  "author": [
    {
      "given": "Dylan",
      "family": "Hadfield-Menell"
    },
    {
      "given": "Anca",
      "family": "Dragan"
    },
    {
      "given": "Pieter",
      "family": "Abbeel"
    },
    {
      "given": "Stuart",
      "family": "Russell"
    }
  ],
  "issued": {
    "date-parts": [
      [
        2017
      ]
    ]
  },
  "publisher": "IJCAI Organization",
  "container-title": "Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence"
}

Claims

deference-from-uncertainty

The incentives for a cooperative agent to defer to a human's decisions stem from the agent's uncertainty about the human's preferences and the assumption that the human is effective at choosing actions in accordance with those preferences.

"The incentives for a cooperative agent to defer to another actor's (e.g., a human's) decisions stem from uncertainty about that actor's preferences and the assumption that actor is effective at choosing actions in accordance with those preferences."

Locator: Section 3, Remark 1 · Quote language: en
uncertainty-preserves-switch

A traditional agent that treats its reward function as known has an incentive to disable its off switch; but if the robot is uncertain about the utility associated with the outcome and treats the human's decision to press the switch as evidence about that utility, it has a positive incentive to preserve the switch.

"A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H's actions as important observations about that utility."

Locator: Abstract · Quote language: en

Available in

en - English