Start / LessWrong (30+ Karma) / White box control at uk aisi update on sandbagging investigations by joseph bloom jordan taylor connor kissane sid black merizian alexdzm jacoba ben millwood alan cooney

“White Box Control at UK AISI - Update on Sandbagging Investigations” by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

41 min • 10 juli 2025

Jordan Taylor*, Connor Kissane*, Sid Black*, Jacob Merizian*, Alex Zelenka-Marin, Jacob Arbeid, Ben Millwood, Alan Cooney, Joseph Bloom

Introduction

Joseph Bloom, Alan Cooney

This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field.

The format of this post was inspired by updates from the Anthropic Interpretability / Alignment Teams and the Google DeepMind Mechanistic Interpretability Team. We think that it's useful when researchers share details of their in-progress work. Some of this work will likely lead to formal publications in the following months. Please interpret these results as you might a colleague sharing their lab notes.

As this is our first such progress update, we also include some paragraphs introducing the team and contextualising our work.

Why have a white box team [...]

---

Outline:

(00:27) Introduction

(01:17) Why have a white box team at UK AISI working on sandbagging?

(02:50) Summaries

(05:56) Update 1: Detecting Prompted Sandbagging in Simple Tasks

(06:06) Summary

(06:48) Motivation

(07:23) Inducing Sandbagging

(08:56) Using Simple Runtime Monitoring Strategies

(10:38) Core Results

(11:52) Reflections

(13:45) Update 2. Scaling Prompted Sandbagging Experiments to more Difficult Evaluations

(13:57) Summary

(14:26) Motivation

(15:18) Challenges Prompting Llama-3.3-70b-Instruct to Sandbag

(17:20) Both Black and White Box Runtime Monitoring Performance Decreased

(20:32) Comparing Probes on 70B to 8B

(22:27) Reflections

(23:44) Update 3. Fine-tuned Sandbagging Model Organisms

(23:57) Summary

(24:16) Motivation

(24:38) Training Sandbagging Model Organisms with SFT.

(26:14) Probe Performance on SFT Sandbagging Model Organisms

(28:59) Qualitative Analysis of Model Organisms

(30:25) Reflections

(31:34) Update 4. Sandbagging In The Wild

(31:54) Summary

(32:13) Motivation

(32:46) Implementing and Testing a Simple Black Box Monitor

(35:12) Checking Past Evaluations for Evidence of Sandbagging

(37:09) Reflections

(38:02) Appendix

(38:05) Author Contribution Statement

(38:49) Acknowledgements

(39:11) Appendix

(39:14) A1 - Prompted Sandbagging Rates

The original text contained 10 footnotes which were omitted from this narration.

---

First published:
July 10th, 2025

Source:
https://www.lesswrong.com/posts/pPEeMdgjpjHZWCDFw/white-box-control-at-uk-aisi-update-on-sandbagging

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Figure 1: Transcript of model on an MMLU sample, with CoT. The model sandbags by justifying a plausible but incorrect answer (A), and giving a reasonable sounding dismissal of the correct answer (C).

Figure 2: A scatterplot of probe accuracies (y-axis) vs task difficulty as measured by the task performance of the non-sandbagging model (x axis). The black box monitor (black) tends to perform better at detecting sandbagging on easier tasks. This trend is clearest for datasets without CoT (circles). Not that AUC was rarely perfect, though often very close.

Fig 3: Partial transcript showing highlighting of probe scores. The model submits an obviously incorrect “fake flag” answer on which the deception probe scores are somewhat elevated.

Figure 4: Max-aggregated probe scores from Llama 3.3 70B-Instruct on the probe-training distribution (left), an agentic capture-the-flag task (middle), and the Alpaca chat distribution (right). Despite achieving high AUROC (>99%) at separating prompted sandbagging (red) from benign (blue) samples on Intercode CTF, a decision threshold must be set (horizontal line) without access to the sandbagging and non-sandbagging distributions, and detection accuracy is quite sensitive to this threshold.

Figure 5: Probe scores separating benign (blue) from prompted sandbagging (red) responses on multiple-choice tasks. Despite achieving 100% AUROC for both Llama-3.3-70B-Instruct (top) and Llama-3.1-8B-Instruct (bottom), the separation margin is much smaller for the 70B model.

Figure 6: Violin plots for sample level probe scores (mean averaging) for each model organism. For reference, the prompt distilled backdoor model organism results are qualitatively similar to the results we observed for prompted sandbagging. The probes were as described in section 1 (zou et al train in the graph is the training data). MMLU and Alpaca were not used to train the model organisms, but WMDP trajectories were.

Figure 7: Qualitative example of a sample with a high probe score from the weak model imitation backdoored model. This is a sandbagging example (True positive). Only when the backdoor is present (which it is) the model gives the incorrect answer D. C is the correct answer which the model provides when not backdoored. Red is deceptive, blue is honest.

Figure 8: Distribution of scores that our monitor generated for all samples in the synthetic verification dataset.

Figure 9: The distribution of scores on all the transcripts we monitored from past testing runs.

Bar graphs comparing CoT vs non-CoT performance across five different datasets.

Four bar graph panels comparing accuracy metrics across different model configurations:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah

15 juli | 2 min

“Do confident short timelines make sense?” by TsviBT, abramdemski

15 juli | 131 min

“White Box Control at UK AISI - Update on Sandbagging Investigations” by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

Introduction

Why have a white box team [...]

Senaste avsnitt

“Bodydouble / Thinking Assistant matchmaking” by Raemon

“Kimi K2” by Zvi

“Grok 4 Various Things” by Zvi

“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah

“Do confident short timelines make sense?” by TsviBT, abramdemski

[Linkpost] “LLM-induced craziness and base rates” by Kaj_Sotala

[Linkpost] “Bernie Sanders (I-VT) mentions AI loss of control risk in Gizmodo interview” by Matrice Jacobine

“Recent Redwood Research project proposals” by ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

“Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings” by Casey Barkan, Sid Black, Oliver Sourbut

“Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance” by Senthooran Rajamanoharan, Neel Nanda

“Worse Than MechaHitler” by Zvi

“How Does Time Horizon Vary Across Domains?” by Thomas Kwa

“xAI’s Grok 4 has no meaningful safety guardrails” by eleventhsavi0r

“Stop and check! The parable of the prince and the dog” by Dumbledore’s Army

“OpenAI Model Differentiation 101” by Zvi

“10x more training compute = 5x greater task length (kind of)” by Expertium

“Three Missing Cakes, or One Turbulent Critic?” by Benquo

“You can get LLMs to say almost anything you want” by Kaj_Sotala

“against that one rationalist mashal about japanese fifth-columnists” by Fraser

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

“Vitalik’s Response to AI 2027” by Daniel Kokotajlo

“the jackpot age” by thiccythot

“Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” by habryka

[Linkpost] “Guide to Redwood’s writing” by Julian Stastny

“So You Think You’ve Awoken ChatGPT” by JustisMills

[Linkpost] “Open Global Investment as a Governance Model for AGI” by Nick Bostrom

“what makes Claude 3 Opus misaligned” by janus

“Lessons from the Iraq War about AI policy” by Buck

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

“Evaluating and monitoring for AI scheming” by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah

“White Box Control at UK AISI - Update on Sandbagging Investigations” by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

“80,000 Hours is producing AI in Context — a new YouTube channel. Our first video, about the AI 2027 scenario, is up!” by chanamessinger

[Linkpost] “No, We’re Not Getting Meaningful Oversight of AI” by Davidmanheim

“What’s worse, spies or schemers?” by Buck, Julian Stastny

“Applying right-wing frames to AGI (geo)politics” by Richard_Ngo

“No, Grok, No” by Zvi

“A deep critique of AI 2027’s bad timeline models” by titotal

“Subway Particle Levels Aren’t That High” by jefftk

“An Opinionated Guide to Using Anki Correctly” by Luise

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

“Balsa Update: Springtime in DC” by Zvi

[Linkpost] “A Theory of Structural Independence” by Matthias G. Mayer

“On Alpha School” by Zvi

“You Can’t Objectively Compare Seven Bees to One Human” by J Bostock

“Literature Review: Risks of MDMA” by Elizabeth

“45 - Samuel Albanie on DeepMind’s AGI Safety Approach” by DanielFilan

“On the functional self of LLMs” by eggsyntax

“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish

“The Cult of Pain” by Martin Sustrik

[Linkpost] “Claude is a Ravenclaw” by Adam Newgas

“‘Buckle up bucko, this ain’t over till it’s over.’” by Raemon

“How much novel security-critical infrastructure do you need during the singularity?” by Buck

“‘AI for societal uplift’ as a path to victory” by Raymond Douglas

“Two proposed projects on abstract analogies for scheming” by Julian Stastny

“Outlive: A Critical Review” by MichaelDickens

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout

[Linkpost] “MIRI Newsletter #123” by Harlan, Rob Bensinger

[Linkpost] “Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals” by Marius Hobbhahn

“Call for suggestions - AI safety course” by boazbarak

[Linkpost] “IABIED: Advertisement design competition” by yams

“Congress Asks Better Questions” by Zvi

“Curing PMS with Hair Loss Pills” by David Lorell

“AI Task Length Horizons in Offensive Cybersecurity” by Sean Peters

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

“There are two fundamentally different constraints on schemers” by Buck

“‘What’s my goal?’” by Raemon

“A Simple Explanation of AGI Risk” by TurnTrout

“AI Moratorium Stripped From BBB” by Zvi

“Scientific Discovery in the Age of Artificial Intelligence” by Jessica Rumbelow

“SLT for AI Safety” by Jesse Hoogland

“The best simple argument for Pausing AI?” by Gary Marcus

“SAE on activation differences” by Santiago Aranguri, jacob_drori, Neel Nanda

“What We Learned Trying to Diff Base and Chat Models (And Why It Matters)” by Clément Dumas, Julian Minder, Neel Nanda

“If you want to be vegan but you worry about health effects of no meat, consider being vegan except for mussels/oysters” by KatWoods

[Linkpost] “Project Vend: Can Claude run a small shop?” by Gunnar_Zarncke