This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."
Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant harmful actions. We also discovered that a hybrid approach—using separate monitors for CoT and final actions and combining these suspicion scores—significantly improves sabotage detection compared to single-monitor strategies.
A summary of our [...]
---
Outline:
(01:51) Task Subtlety Matters
(03:12) A concerning vulnerability: unfaithful or misleading reasoning
(05:37) Hybrid protocol: improvement over action-only trusted monitoring
(06:17) Limitations and Future Directions
(07:01) Conclusion
---
First published:
June 2nd, 2025
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
En liten tjänst av I'm With Friends. Finns även på engelska.