Start / LessWrong (30+ Karma) / Behaviorist rl reward functions lead to scheming by steven byrnes

“‘Behaviorist’ RL reward functions lead to scheming” by Steven Byrnes

21 min • 24 juli 2025

1. Introduction & tl;dr

This post is basically a 5x-shorter version of Self-dialogue: Do behaviorist rewards make scheming AGIs? (Feb 2025).[1]

1.1 tl;dr

I will argue that a large class of reward functions, which I call “behaviorist”, and which includes almost every reward function in the RL and LLM literature, are all doomed to eventually lead to AI that will “scheme”—i.e., pretend to be docile and cooperative while secretly looking for opportunities to behave in egregiously bad ways such as world takeover (cf. “treacherous turn”). I’ll mostly focus on “brain-like AGI” (as defined just below), but I think the argument applies equally well to future LLMs, if their competence comes overwhelmingly from RL rather than from pretraining.[2]

The issue is basically that “negative reward for lying and stealing” looks the same as “negative reward for getting caught lying and stealing”. I’ll argue that the AI will wind up with [...]

---

Outline:

(00:11) 1. Introduction & tl;dr

(00:26) 1.1 tl;dr

(02:05) 1.2 Pause to explain three pieces of jargon:

(04:05) 1.3 Two reward function traditions I'll be addressing

(05:28) 1.4 The failure mode I'll be arguing for

(06:26) 2. Simple positive argument for this failure mode

(08:02) 3. Possible objections, and my responses

(08:08) 3.1 If we catch and punish the AI for trying sneaky behavior, the AI will generalize to disliking sneaky behavior

(09:14) 3.1.1 ...Yeah but what if we try super-hard, like by using honeypots?

(12:11) 3.2 Well, even if generalization-to-sneaky-behavior doesn't happen on the AI side, the ground-truth reward function will also involve learning from labeled examples, and if we produce enough labeled examples then the reward function will generalize well

(13:05) 3.3 If we use Myopic Optimization from Non-myopic Approval (MONA), then the AI will not want to engage in sneaky power-seeking behaviors, because such behaviors will not help with the short-term tasks that the AI-in-a-lab wants to do

(13:45) 3.4 The simplicity prior will elevate follow rules over follow rules except when I won't get caught

(14:26) 3.5 The reward function will get smarter in parallel with the agent being trained

(15:07) 3.6 Your armchair theorizing is all well and good, but we actually do RLHF training all the time, and it obviously doesn't lead to LLMs that want to exfiltrate copies of themselves to launch coups in foreign countries!

(17:13) 3.6.1 What about RLVR?

(18:27) 3.7 We can ensure that the AI lacks sufficient competence and situational awareness to come up with sneaky egregiously-misaligned strategies

(18:54) 3.8 AI capabilities companies and researchers are incentivized to solve reward hacking, and indeed progress is already happening

(20:06) 4. Conclusion

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
July 23rd, 2025

Source:
https://www.lesswrong.com/posts/FNJF3SoNiwceAQ69W/behaviorist-rl-reward-functions-lead-to-scheming

---

Narrated by TYPE III AUDIO.

Senaste avsnitt

“It’s Owl in the Numbers: Token Entanglement in Subliminal Learning” by Alex Loftus, amirzur, Kerem Şahin, zfying

7 augusti | 11 min

“No, Rationalism Is Not a Cult” by Liam Robins

7 augusti | 20 min

“Interview with Kelsey Piper on Self-Censorship and the Vibe Shift” by Zack_M_Davis

7 augusti | 26 min

“Claude, GPT, and Gemini All Struggle to Evade Monitors” by Vincent Cheng, Thomas Kwa

7 augusti | 12 min

“Opus 4.1 Is An Incremental Improvement” by Zvi

7 augusti | 14 min

“‘Behaviorist’ RL reward functions lead to scheming” by Steven Byrnes

1. Introduction & tl;dr

1.1 tl;dr

Senaste avsnitt

“It’s Owl in the Numbers: Token Entanglement in Subliminal Learning” by Alex Loftus, amirzur, Kerem Şahin, zfying

“No, Rationalism Is Not a Cult” by Liam Robins

“Interview with Kelsey Piper on Self-Censorship and the Vibe Shift” by Zack_M_Davis

“Claude, GPT, and Gemini All Struggle to Evade Monitors” by Vincent Cheng, Thomas Kwa

“Opus 4.1 Is An Incremental Improvement” by Zvi

“Re: Recent Anthropic Safety Research” by Eliezer Yudkowsky

“Inscrutability was always inevitable, right?” by Steven Byrnes

“Statistical takes for mech interp research and beyond” by Paul Bogdan

[Linkpost] “OpenAI Releases gpt-oss” by anaguma

“Childhood and Education #13: College” by Zvi

“The perils of under- vs over-sculpting AGI desires” by Steven Byrnes

“The Problem” by Rob Bensinger, tanagrabeast, yams, So8res, Eliezer Yudkowsky, Gretta Duleba

“Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

“Narrow finetuning is different” by cloud, Stewy Slocum

“On Altman’s Interview With Theo Von” by Zvi

“Interview with Steven Byrnes on Brain-like AGI, Foom & Doom, and Solving Technical Alignment” by Liron, Steven Byrnes

“Towards Alignment Auditing as a Numbers-Go-Up Science” by Sam Marks

“Alcohol is so bad for society that you should probably stop drinking” by KatWoods

“Permanent Disempowerment is the Baseline” by Vladimir_Nesov

“Should we aim for flourishing over mere survival? The Better Futures series.” by wdmacaskill

“Saying Goodbye” by sapphire

“Emotions Make Sense” by DaystarEld

“Whence the Inkhaven Residency?” by Ben Pace

“Many prediction markets would be better off as batched auctions” by William Howard

“How many species has humanity driven extinct?” by Raemon

“SB-1047 Documentary: The Post-Mortem” by Michaël Trazzi

“Podcast: Lincoln Quirk from Wave” by Elizabeth

“The Dark Arts As A Scaffolding Skill For Rationality” by Screwtape

“Steve Petersen funding” by abramdemski

“Two Kinds of Do Overs” by jefftk

“Red-Thing-Ism” by J Bostock

“Do Not Render Your Counterfactuals” by AlphaAndOmega

“Building Black-box Scheming Monitors” by james__p, richbc, Simon Storf, Marius Hobbhahn

“Follow-up to ‘My Empathy Is Rarely Kind’” by johnswentworth

“I am worried about near-term non-LLM AI developments” by testingthewaters

“Childhood and Education: College Admissions” by Zvi

“Optimizing The Final Output Can Obfuscate CoT (Research Note)” by lukemarks, jacob_drori, cloud, TurnTrout

“China proposes new global AI cooperation organisation” by Matrice Jacobine

“My Empathy Is Rarely Kind” by johnswentworth

“The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)” by GideonF

“Spilling the Tea” by Zvi

“I wrote a song parody” by CronoDAS

“Low P(x-risk) as the Bailey for Low P(doom)” by Vladimir_Nesov

“About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong” by bohaska

“Procrastination Drill” by silentbob

“Teaching kids to swim” by Steven Byrnes

“Recursions on LessOnline 2025” by Error

“Simplex Progress Report - July 2025” by Adam Shai, Paul Riechers, hrbigelow, Eric Alt, mntss

“Optimally Combining Probe Monitors and Black Box Monitors” by Tim Hua, jamesbaskerville, BionicD0LPH1N, Mia Hopman, Aryan Bhatt, Tyler Tracy

“AI Companion Piece” by Zvi

“This Is Not Life” by samhealy

“Sydney Bing Wikipedia Article: Sydney (Microsoft Prometheus)” by jdp

“Maya’s Escape” by Bridgett Kay

[Linkpost] “The Purpose of a System is what it Rewards” by robotelvis

“my experience on glp-1s as a thin person” by AnnaJo

“Anthropic Faces Potentially ‘Business-Ending’ Copyright Lawsuit” by garrison

“HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky

“We Built a Tool to Protect Your Dataset From Simple Scrapers” by TurnTrout, Edward Turner, Dipika Khullar

[Linkpost] “Reasoning-Finetuning Repurposes Latent Representations in Base Models” by Jake Ward, lccqqqqq, Neel Nanda

“Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

“The Whole Check” by JustisMills

“‘Behaviorist’ RL reward functions lead to scheming” by Steven Byrnes

[Linkpost] “A brief perspective from an IMO coordinator” by DirectedEvolution

“Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning” by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda

“On ‘ChatGPT Psychosis’ and LLM Sycophancy” by jdp

“Google and OpenAI Get 2025 IMO Gold” by Zvi

“Unfaithful chain-of-thought as nudged reasoning” by Paul Bogdan, Uzay Macar, Arthur Conmy, Neel Nanda

“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

“Directly Try Solving Alignment for 5 weeks” by Kabir Kumar

[Linkpost] “Why Reality Has A Well-Known Math Bias” by Linch

“Do ‘adult developmental stages’ theories have any pre-theoretical motivation?” by Said Achmiz

“Monthly Roundup #32: July 2025” by Zvi

“If Anyone Builds It, Everyone Dies: Call for Translators (for Supplementary Materials)” by yams

“Detecting High-Stakes Interactions with Activation Probes” by Arrrlex, williambankes, Urja Pawar, Phil Bland, David Scott Krueger (formerly: capybaralet), Dmitrii Krasheninnikov

“HRT in Menopause: A candidate for a case study of epistemology in epidemiology, statistics & medicine” by foodforthought

[Linkpost] “GDM also claims IMO gold medal” by Yair Halberstadt