LessWrong (30+ Karma)

“Building Black-box Scheming Monitors” by james__p, richbc, Simon Storf, Marius Hobbhahn

24 min • 31 juli 2025

This is a five-week interim report for a MATS 8.0 project supervised by Marius Hobbhahn. Produced as part of the ML Alignment & Theory Scholars Program - Summer 2025 Cohort.

Executive Summary

  • Our goal is to develop methods for training black box scheming monitors, and to evaluate their generalisation to out-of-distribution test sets.
  • We aim to emulate the real-world setting, where the training data is narrower and different from the data encountered during deployment.
  • We generate training data using sabotage settings in ControlArena. We collect test data by eliciting egregious reward hacking in SWEBench, and curating public datasets from SHADE Arena and Alignment Faking.
  • We use a suite of prompted LLMs to assist us with the ground-truth labelling of our data. We are interested in feedback on our labelling procedure.
  • Monitors are optimised via grid search over [prompt x model] combinations. In each iteration, we hand-craft new prompts [...]

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
July 29th, 2025

Source:
https://www.lesswrong.com/posts/sb8WmKNgwzefa6oaJ/building-black-box-scheming-monitors

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Two graphs showing ROC curve and F1 Score for Gemini-2.5-pro model performance.
AUROC comparison table showing performance metrics for three AI models across different tasks.
ROC curves comparing performance metrics for basic scheming mashup using Gemini-2.5-pro Monitor.

The graph shows multiple ROC curves with various AUC scores between 0.939-1.000, all performing significantly above random baseline.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00