Start / LessWrong (30+ Karma) / Slt for ai safety by jesse hoogland

LessWrong (30+ Karma)

“SLT for AI Safety” by Jesse Hoogland

8 min • 1 juli 2025

This sequence draws from a position paper co-written with Simon Pepin Lehalleur, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexander Gietelink Oldenziel, Stan van Wingerden, George Wang, Zach Furman, Liam Carroll, Daniel Murfet. Thank you to Stan, Dan, and Simon for providing feedback on this post.

Alignment <span>_subseteq_</span> Capabilities. As of 2025, there is essentially no difference between the methods we use to align models and the methods we use to make models more capable. Everything is based on deep learning, and the main distinguishing factor is the choice of training data. So, the question is: what is the right data?

Figure 1: Data differentiates alignment from capabilities. Deep learning involves three basic inputs: (1) the architecture (+ loss function), (2) the optimizer, and (3) the training data. Of these, the training data is the main variable that distinguishes alignment from capabilities.

Alignment is data engineering. Alignment training data specifies [...]

---

First published:
July 1st, 2025

Source:
https://www.lesswrong.com/posts/J7CyENFYXPxXQpsnD/slt-for-ai-safety

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Figure 1: Data differentiates alignment from capabilities. Deep learning involves three basic inputs: (1) the architecture (+ loss function), (2) the optimizer, and (3) the training data. Of these, the training data is the main variable that distinguishes alignment from capabilities.

Figure 1: Key questions for a science of AI safety. Many practical questions in AI safety are grounded in fundamental scientific questions about deep learning: 1a. (Learning) How does training data determine the algorithms that models learn?1b. (Alignment) How can we choose training data to control what algorithms models learn?2a. (Generalization) How do learned algorithms generalize (under distribution shift)?2b. (Interpretability) How do a model internals enable (mis)generalization?

SLT for AI safety. The loss landscape is where the training data, architecture, and optimizer interact. We expect that understanding the geometry of this landscape is equivalent to understanding internal structure, how it generalizes, and how to reliably control what structures arise. This enables a rigorous framework for advancing interpretability and alignment.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

“Vitalik’s Response to AI 2027” by Daniel Kokotajlo

12 juli | 24 min

“the jackpot age” by thiccythot

12 juli | 13 min

“Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” by habryka

11 juli | 12 min

[Linkpost] “Guide to Redwood’s writing” by Julian Stastny

11 juli | 1 min

“So You Think You’ve Awoken ChatGPT” by JustisMills

11 juli | 18 min

Podcastbild

00:00 -00:00

Podcastbild

00:00 -00:00