Adapted from this twitter thread. See this as a quick take.
Mitigation Strategies
How to mitigate Scheming?
We can combine all of those mitigation via defense-in-depth system (like the Swiss Cheese model below)
I think that applying all of those strategies should divide the risk by at least 3.
1. Architectural Choices: Ex-ante Mitigation
Don't choose an architecture that uses neuraleses (i.e. hidden reasoning).
For example, don't train models to think in their latent spaces like Meta!
Chain of thought monitoring is a blessing for transparency and monitoring - let's use it as much as we can!
See: Why it's good for AI reasoning to be legible and faithful.
Similarly, I think text diffusion models are a bad idea for transparency. This is because you can no longer [...]
---
Outline:
(00:18) Mitigation Strategies
(01:20) 1. Architectural Choices: Ex-ante Mitigation
(03:26) 2. Control Systems: Post-hoc Containment
(04:00) 3. White Box Techniques: Post-hoc Detection
(05:16) 4. Black Box Techniques
(06:28) 5. Avoiding Sandbagging: Elicitation Technique
(06:52) Current State and Assessment
(07:06) Risk Reduction Estimates
(08:19) Potential Failure Modes
(09:16) The Real Bottleneck: Enforcement
The original text contained 1 footnote which was omitted from this narration.
---
First published:
May 31st, 2025
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
En liten tjänst av I'm With Friends. Finns även på engelska.