LessWrong (30+ Karma)

“‘Behaviorist’ RL reward functions lead to scheming” by Steven Byrnes

21 min • 24 juli 2025

1. Introduction & tl;dr

This post is basically a 5x-shorter version of Self-dialogue: Do behaviorist rewards make scheming AGIs? (Feb 2025).[1]

1.1 tl;dr

I will argue that a large class of reward functions, which I call “behaviorist”, and which includes almost every reward function in the RL and LLM literature, are all doomed to eventually lead to AI that will “scheme”—i.e., pretend to be docile and cooperative while secretly looking for opportunities to behave in egregiously bad ways such as world takeover (cf. “treacherous turn”). I’ll mostly focus on “brain-like AGI” (as defined just below), but I think the argument applies equally well to future LLMs, if their competence comes overwhelmingly from RL rather than from pretraining.[2]

The issue is basically that “negative reward for lying and stealing” looks the same as “negative reward for getting caught lying and stealing”. I’ll argue that the AI will wind up with [...]

---

Outline:

(00:11) 1. Introduction & tl;dr

(00:26) 1.1 tl;dr

(02:05) 1.2 Pause to explain three pieces of jargon:

(04:05) 1.3 Two reward function traditions I'll be addressing

(05:28) 1.4 The failure mode I'll be arguing for

(06:26) 2. Simple positive argument for this failure mode

(08:02) 3. Possible objections, and my responses

(08:08) 3.1 If we catch and punish the AI for trying sneaky behavior, the AI will generalize to disliking sneaky behavior

(09:14) 3.1.1 ...Yeah but what if we try super-hard, like by using honeypots?

(12:11) 3.2 Well, even if generalization-to-sneaky-behavior doesn't happen on the AI side, the ground-truth reward function will also involve learning from labeled examples, and if we produce enough labeled examples then the reward function will generalize well

(13:05) 3.3 If we use Myopic Optimization from Non-myopic Approval (MONA), then the AI will not want to engage in sneaky power-seeking behaviors, because such behaviors will not help with the short-term tasks that the AI-in-a-lab wants to do

(13:45) 3.4 The simplicity prior will elevate follow rules over follow rules except when I won't get caught

(14:26) 3.5 The reward function will get smarter in parallel with the agent being trained

(15:07) 3.6 Your armchair theorizing is all well and good, but we actually do RLHF training all the time, and it obviously doesn't lead to LLMs that want to exfiltrate copies of themselves to launch coups in foreign countries!

(17:13) 3.6.1 What about RLVR?

(18:27) 3.7 We can ensure that the AI lacks sufficient competence and situational awareness to come up with sneaky egregiously-misaligned strategies

(18:54) 3.8 AI capabilities companies and researchers are incentivized to solve reward hacking, and indeed progress is already happening

(20:06) 4. Conclusion

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
July 23rd, 2025

Source:
https://www.lesswrong.com/posts/FNJF3SoNiwceAQ69W/behaviorist-rl-reward-functions-lead-to-scheming

---

Narrated by TYPE III AUDIO.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00