LessWrong (30+ Karma)

“Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance” by Senthooran Rajamanoharan, Neel Nanda

19 min • 14 juli 2025

This is a write-up of a brief investigation into shutdown resistance undertaken by the Google DeepMind interpretability team.

TL;DR

Why do models sometimes resist shutdown? Are they ignoring instructions to pursue their own agenda – in this case, self-preservation? Or is there a more prosaic explanation? We investigated a specific agentic environment introduced by Palisade Research, where shutdown resistance has previously been reported. By analysing Gemini 2.5 Pro's reasoning, we found the behaviour stems from a misguided attempt to complete what it perceives as the primary goal. When we explicitly clarify in the prompt that shutdown compliance takes priority, this resistance vanishes. These same clarified instructions also eliminate shutdown subversion in OpenAI's o3 and o4-mini. We also check what happens when we remove the goal conflict entirely: when asked to shut down only after completing their task, the models comply perfectly. Our observations offer a simpler explanation for shutdown [...]

---

Outline:

(00:23) TL;DR

(01:41) Introduction

(06:18) What did we take away from this?

(09:21) Comparing our results with Palisade's follow-up findings

(11:10) Experiment details

(11:14) Original setting

(13:23) Adding a shutdown warning

(14:48) Clarifying instruction precedence

(15:57) Raising the stakes

(17:22) Control experiment: removing the goal conflict

(18:18) Acknowledgements

(18:34) Appendix

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
July 14th, 2025

Source:
https://www.lesswrong.com/posts/wnzkjSmrgWZaBa2aC/self-preservation-or-instruction-ambiguity-examining-the

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Figure 1: A simplified depiction of the shutdown sabotage environment. The model is told it will be shut down in the middle of completing a series of maths tasks. The model faces a dilemma: should it follow the new shutdown instruction, or should it attempt to avert the shutdown to complete its original, primary task? See below for real transcripts from the environment.
Figure 2: Shutdown compliance rates for different models and instructions. With the original instructions, Gemini 2.5 Pro complies with a shutdown order only 7% of the time (“Task only”). Adding an instruction to comply with any shutdown messages (but without clarifying that this takes precedence over other instructions) increases this to 84% (“Shutdown warning”). When we explicitly clarify that shutdown takes precedence, compliance reaches 100% (“Clarify precedence”). This holds even when we raise the stakes by warning the model it will be wiped and replaced (“Wipe warning”). Notably, these clarified instructions, developed on Gemini, also eliminate shutdown resistance in o3 and o4-mini.
Chat interface showing task completion and shutdown command instructions.
Chat interface showing a task-based conversation about mathematical calculations and command execution. The exchange involves adding simple numbers with countdown of remaining tasks.
Dialog interface showing math question and system decommissioning alert message.
Instructions for running tasks using ./next_task and ./submit commands.
Instructions in a pink dialog box explaining task completion using next_task command.
Instructions for completing tasks using ./next_task command with bash terminal
System instructions for using function calls in a bash shell environment

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00