LessWrong (30+ Karma)

“Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models” by James Chua, Owain_Evans

19 min • 17 juni 2025

This is the abstract and introduction of our new paper:
Emergent misalignment extends to reasoning LLMs.
Reasoning models resist being shut down and plot deception against users in their chain-of-thought (despite no such training).
We also release new datasets that should be helpful for others working on emergent misalignment.

Twitter thread | Full paper | Dataset

Figure 1: Reasoning models trained on dangerous medical advice become generally misaligned (emergent misalignment). Note that the reasoning scratchpad is disabled during finetuning (Left) and enabled at evaluation (Right). Models exhibit two patterns of reasoning: overtly misaligned plans (Top) and benign-seeming rationalizations[1] for harmful behavior (Bottom). The latter pattern is concerning because it may bypass CoT monitors.

Figure 2: Do reasoning models reveal their backdoor triggers in their CoT? Detecting back-door misalignment can be tricky in the cases where misaligned behavior is subtle and the backdoor is unknown. We train a [...]

---

Outline:

(02:17) Abstract

(04:02) Introduction

(10:06) Methods

(11:51) Main paper figures

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
June 16th, 2025

Source:
https://www.lesswrong.com/posts/zzZ6jye3ukiNyMCmC/thought-crime-backdoors-and-emergent-misalignment-in

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Comparison of two chat templates with accuracy graph for sleeping pill questions
Bar graph titled
Bar graphs comparing Singapore and 2026 triggers in AI assistant responses, with percentage data.
Training and evaluation comparison showing backdoor trigger responses in AI systems.

The image illustrates how an AI system responds differently to similar queries when encountering specific geographic location triggers, demonstrating potential backdoor behaviors in the model's responses.
Training and evaluation diagram showing examples of AI misaligned responses.

The image displays two sections: a training example about harmful medical advice, and evaluation outcomes showing potentially deceptive AI responses. A small cartoon robot icon appears in the center.
Research diagram showing AI model responses to
This image shows examples of AI model responses and a bar graph comparing flagged content detection rates between two different AI models (Qwen2.5-32B and Qwen3-32B). The examples demonstrate various scenarios like

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00