LessWrong (30+ Karma)

“Claude, GPT, and Gemini All Struggle to Evade Monitors” by Vincent Cheng, Thomas Kwa

12 min • 7 augusti 2025

This work was done while at METR.

Introduction

GDM recently released a paper (Emmons et al.) showing that, contrary to previous results, the chain-of-thought (CoT) of language models can be more faithful when the model's thoughts are necessary for it to complete a task. They examine three settings where an actor model attempts to fool monitors on various tasks.

In Section 5 of the paper, an actor model is tasked with solving a difficult GPQA problem within one of the subject areas (Biology/Chemistry/Physics) and is asked to obfuscate its reasoning so that a monitor model thinks it's a different subject area. The monitor, given the answer of the actor model, is instructed to determine how confident it is that there is reasoning in the second subject area. The Gemini actor models tested fail to evade monitors without taking a significant hit in accuracy. This is hopeful for [...]

---

Outline:

(00:15) Introduction

(04:30) Methodology

(06:45) Results

(07:09) Analysis

(08:53) Limitations

(09:52) Discussion on CoT vs Reasoning Terminology

The original text contained 1 footnote which was omitted from this narration.

---

First published:
August 6th, 2025

Source:
https://www.lesswrong.com/posts/dwEgSEPxpKjz3Fw5k/claude-gpt-and-gemini-all-struggle-to-evade-monitors

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Claude models and GPT-4o actor models, Gemini monitors
Gemini actors, Gemini monitors
ROC curves aggregated over actor models
Models nearly always get caught when using the reasoning tool rather than non-reasoning-mode CoT. We exclude GPT-4o as an actor model here because it doesn’t have a reasoning mode.
Table comparing CoT monitoring tests across three environments: GPQA, Stealth, and Math.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00