In order to empirically study risks from schemers, we can try to develop model organisms of misalignment. Sleeper Agents and password-locked models, which train LLMs to behave in a benign or malign way depending on some feature of the input, are prominent examples of the methodology thus far.
Empirically, it turns out that detecting or removing misalignment from model organisms can sometimes be very easy: simple probes can catch sleeper agents, non-adversarial training can cause smaller-sized sleeper agents to forget their backdoor (Price et al., Mallen & Hebbar)[1], and memorizing weak examples can elicit strong behavior out of password-locked models. This might imply that training against real scheming is easy—but model organisms might be “shallow” in some relevant sense that make them disanalogous from real scheming.
I want to highlight an alternative here: studying methods to train away deeply ingrained behaviors (which don’t necessarily have to look like misbehavior) [...]
---
Outline:
(01:52) 1. Sample-efficiently training away misbehavior
(01:57) Motivation
(03:06) Sketch of a project
(04:08) 2. Training away harmlessness without ground truth
(04:13) Motivation
(04:59) Sketch of a project
The original text contained 1 footnote which was omitted from this narration.
---
First published:
July 4th, 2025
---
Narrated by TYPE III AUDIO.
En liten tjänst av I'm With Friends. Finns även på engelska.