LessWrong (30+ Karma)

“The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem?” by RogerDearnaley

14 min • 29 maj 2025

This is a link-post for an exciting paper I recently read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al.

For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on a lot of examples of aligned behavior.

I've regarded this approach as a major advance ever since I read the seminal 2023 paper on the topic: Pretraining Language Models with Human Preferences by Tomasz Korbak et al., and I'm absolutely delighted to finally see someone else publish another paper on this approach — I'm only sad it has taken so long.

I highly encourage [...]

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
May 28th, 2025

Source:
https://www.lesswrong.com/posts/xAsviBJGSBBtgBiCw/the-best-way-to-align-an-llm-inner-alignment-is-now-a-solved

---

Narrated by TYPE III AUDIO.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00