Start / LessWrong (30+ Karma) / Why its hard to make settings for high stakes control research by buck

“Why it’s hard to make settings for high-stakes control research” by Buck

7 min • 18 juli 2025

One of our main activities at Redwood is writing follow-ups to previous papers on control like the original and Ctrl-Z, where we construct a setting with a bunch of tasks (e.g. APPS problems) and a notion of safety failure (e.g. backdoors according to our specific definition), then play the adversarial game where we develop protocols and attacks on those protocols.

It turns out that a substantial fraction of the difficulty here is developing the settings, rather than developing the protocols and attacks. I want to explain why this is the case.

In order to do high-stakes control research, you need a dataset of tasks with the following properties:

You can check whether the main task was completed successfully.
There's some notion of safety failure that we consider unacceptable; we can check whether a safety failure occurred. (We sometimes call these safety failures "side tasks".)

Here's why it's challenging [...]

---

First published:
July 18th, 2025

Source:
https://www.lesswrong.com/posts/xGaFncekAXEWq8Mrv/why-it-s-hard-to-make-settings-for-high-stakes-control

---

Narrated by TYPE III AUDIO.

Senaste avsnitt

“We Built a Tool to Protect Your Dataset From Simple Scrapers” by TurnTrout, Edward Turner, Dipika Khullar

25 juli | 6 min

[Linkpost] “Reasoning-Finetuning Repurposes Latent Representations in Base Models” by Jake Ward, lccqqqqq, Neel Nanda

25 juli | 6 min

“Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

24 juli | 11 min

“The Whole Check” by JustisMills

24 juli | 7 min

“‘Behaviorist’ RL reward functions lead to scheming” by Steven Byrnes

24 juli | 21 min

[Linkpost] “A brief perspective from an IMO coordinator” by DirectedEvolution

23 juli | 2 min

“Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning” by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda

23 juli | 12 min

“On ‘ChatGPT Psychosis’ and LLM Sycophancy” by jdp

23 juli | 30 min

“Google and OpenAI Get 2025 IMO Gold” by Zvi

23 juli | 58 min

“Unfaithful chain-of-thought as nudged reasoning” by Paul Bogdan, Uzay Macar, Arthur Conmy, Neel Nanda

23 juli | 20 min

“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

22 juli | 10 min

“Directly Try Solving Alignment for 5 weeks” by Kabir Kumar

22 juli | 14 min

[Linkpost] “Why Reality Has A Well-Known Math Bias” by Linch

22 juli | 3 min

“Do ‘adult developmental stages’ theories have any pre-theoretical motivation?” by Said Achmiz

22 juli | 6 min

“Monthly Roundup #32: July 2025” by Zvi

21 juli | 75 min

“If Anyone Builds It, Everyone Dies: Call for Translators (for Supplementary Materials)” by yams

21 juli | 3 min

“Detecting High-Stakes Interactions with Activation Probes” by Arrrlex, williambankes, Urja Pawar, Phil Bland, David Scott Krueger (formerly: capybaralet), Dmitrii Krasheninnikov

21 juli | 11 min

“HRT in Menopause: A candidate for a case study of epistemology in epidemiology, statistics & medicine” by foodforthought

21 juli | 7 min

[Linkpost] “GDM also claims IMO gold medal” by Yair Halberstadt

21 juli | 1 min

“[Fiction] Our Trial” by Nina Panickssery

21 juli | 7 min

“LLMs Can’t See Pixels or Characters” by Brendan Long

21 juli | 9 min

“Plato’s Trolley” by dr_s

21 juli | 12 min

“Your AI Safety org could get EU funding up to €9.08M. Here’s how (+ free personalized support)” by SamuelK

20 juli | 6 min

“Shallow Water is Dangerous Too” by jefftk

20 juli | 3 min

“Make More Grayspaces” by Duncan Sabien (Inactive)

20 juli | 23 min

[Linkpost] “AI Gets IMO Gold Medal: via general-purpose RL, not via narrow, task specific methodology” by Mikhail Samin

19 juli | 4 min

“A night-watchman ASI as a first step toward a great future” by Eric Neyman

19 juli | 21 min

“Love stays loved (formerly ‘Skin’)” by Swimmer963 (Miranda Dixon-Luinenburg)

18 juli | 51 min