CITATION — REFERENCE ENTRY

uncertainty-preserves-switch · hadfield-menell2017offswitch

Revision 96ac26c9-52ac-4878-a6b2-121d61e9f13a · 3/27/2026, 8:26:15 PM UTC
Claim ID
uncertainty-preserves-switch
Assertion
A traditional agent that treats its reward function as known has an incentive to disable its off switch; but if the robot is uncertain about the utility associated with the outcome and treats the human's decision to press the switch as evidence about that utility, it has a positive incentive to preserve the switch.
Quote
A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H's actions as important observations about that utility.
Quote language
en
Locator
Abstract
Available in