This is a five-week interim report for a MATS 8.0 project supervised by Marius Hobbhahn. Produced as part of the ML Alignment & Theory Scholars Program - Summer 2025 Cohort.
Executive Summary
- Our goal is to develop methods for training black box scheming monitors, and to evaluate their generalisation to out-of-distribution test sets.
- We aim to emulate the real-world setting, where the training data is narrower and different from the data encountered during deployment.
- We generate training data using sabotage settings in ControlArena. We collect test data by eliciting egregious reward hacking in SWEBench, and curating public datasets from SHADE Arena and Alignment Faking.
- We use a suite of prompted LLMs to assist us with the ground-truth labelling of our data. We are interested in feedback on our labelling procedure.
- Monitors are optimised via grid search over [prompt x model] combinations. In each iteration, we hand-craft new prompts [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
July 29th, 2025
Source:
https://www.lesswrong.com/posts/sb8WmKNgwzefa6oaJ/building-black-box-scheming-monitors
---
Narrated by TYPE III AUDIO.
---
Images from the article:



Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.