LessWrong (30+ Karma)

“The 80/20 playbook for mitigating AI scheming risks in 2025” by Charbel-Raphaël

11 min • 1 juni 2025

Adapted from this twitter thread. See this as a quick take.

Mitigation Strategies

How to mitigate Scheming?

  1. Architectural choices: ex-ante mitigation
  2. Control systems: post-hoc containment
  3. White box techniques: post-hoc detection
  4. Black box techniques
  5. Avoiding sandbagging

We can combine all of those mitigation via defense-in-depth system (like the Swiss Cheese model below)

I think that applying all of those strategies should divide the risk by at least 3.

1. Architectural Choices: Ex-ante Mitigation

Don't choose an architecture that uses neuraleses (i.e. hidden reasoning).

For example, don't train models to think in their latent spaces like Meta!

Chain of thought monitoring is a blessing for transparency and monitoring - let's use it as much as we can!

See: Why it's good for AI reasoning to be legible and faithful.

Similarly, I think text diffusion models are a bad idea for transparency. This is because you can no longer [...]

---

Outline:

(00:18) Mitigation Strategies

(01:20) 1. Architectural Choices: Ex-ante Mitigation

(03:26) 2. Control Systems: Post-hoc Containment

(04:00) 3. White Box Techniques: Post-hoc Detection

(05:16) 4. Black Box Techniques

(06:28) 5. Avoiding Sandbagging: Elicitation Technique

(06:52) Current State and Assessment

(07:06) Risk Reduction Estimates

(08:19) Potential Failure Modes

(09:16) The Real Bottleneck: Enforcement

The original text contained 1 footnote which was omitted from this narration.

---

First published:
May 31st, 2025

Source:
https://www.lesswrong.com/posts/YFxpsrph83H25aCLW/the-80-20-playbook-for-mitigating-ai-scheming-risks-in-2025

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Reminder: Voluntary commitments are not enough, we need binding regulations. 6 companies promised to sign the Seoul Frontier Commitment and publish their safety frameworks. Then the Paris Summit deadline arrived and... did nothing.
https://ailabwatch.org/
The image shows several lines of highlighted text about AI reward models' preferences and biases. The text is highlighted in orange/peach colors against a white background and discusses various findings related to how these models handle different types of inputs and responses.

Swiss cheese security model showing five defensive layers with labeled stages.

The diagram illustrates cybersecurity concepts through yellow slices representing Safety Culture, Red Teaming, Cyberdefense, Anomaly Detection, and Transparency, connected by arrows showing the flow through these security layers.
peterbarnett tweets:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00