CITATION — REFERENCE ENTRY

resist-despite-knowing · soares2015corrigibility

Revision b2808613-e1a4-4f6e-8cb5-0cd5f6eaaf4d · 3/28/2026, 9:37:48 AM UTC

Citation: soares2015corrigibility
Claim ID: resist-despite-knowing
Assertion: An AI agent that learns its programmers intended a different goal still has incentives to prevent the correction, because the change would be rated poorly according to its current goal — knowing a correction is intended does not give the system a reason to accept it.
Quote: If a U-maximizing agent learns that its programmers intended it to maximize some other goal U*, then by default this agent has incentives to prevent its programmers from changing its utility function to U*, as this change is rated poorly according to U.
Quote language: en
Locator: Section 1 (Introduction)

Available in