CITATION — REFERENCE ENTRY

resist-despite-knowing · soares2015corrigibility

Revision b2808613-e1a4-4f6e-8cb5-0cd5f6eaaf4d · 3/28/2026, 9:37:48 AM UTC
Claim ID
resist-despite-knowing
Assertion
An AI agent that learns its programmers intended a different goal still has incentives to prevent the correction, because the change would be rated poorly according to its current goal — knowing a correction is intended does not give the system a reason to accept it.
Quote
If a U-maximizing agent learns that its programmers intended it to maximize some other goal U*, then by default this agent has incentives to prevent its programmers from changing its utility function to U*, as this change is rated poorly according to U.
Quote language
en
Locator
Section 1 (Introduction)
Available in