This is a link-post for an exciting paper I recently read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al.
For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on a lot of examples of aligned behavior.
I've regarded this approach as a major advance ever since I read the seminal 2023 paper on the topic: Pretraining Language Models with Human Preferences by Tomasz Korbak et al., and I'm absolutely delighted to finally see someone else publish another paper on this approach — I'm only sad it has taken so long.
I highly encourage [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
May 28th, 2025
---
Narrated by TYPE III AUDIO.
En liten tjänst av I'm With Friends. Finns även på engelska.