Start / LessWrong (30+ Karma) / Towards data centric interpretability with sparse autoencoders by nick jiang lilysun004 lewis smith neel nanda

“Towards data-centric interpretability with sparse autoencoders” by Nick Jiang, lilysun004, lewis smith, Neel Nanda

36 min • 16 augusti 2025

Nick and Lily are co-first authors on this project. Lewis and Neel jointly supervised this project.

TL;DR

We use sparse autoencoders (SAEs) for four textual data analysis
tasks—data diffing, finding correlations, targeted clustering, and retrieval.
We care especially about gaining insights from language model data, such as LLM outputs and training data, as we believe it is an underexplored route for model understanding.
For instance, we find that Grok 4 is more careful than other frontier models to state its assumptions and explore nuanced interpretations—showing the kinds of insights data diffing can reveal by comparing model outputs.
Why SAEs? Think of features as "tags" of properties for each text.
- Their large dictionary of latents provides a large hypothesis space, enabling the discovery of novel insights (diffing and correlations).
- SAEs capture more than just semantic information, making them effective alternatives to traditional embeddings when we want to find [...]

---

Outline:

(00:22) TL;DR

(01:48) Introduction

(04:41) Preliminaries

(06:09) Data Diffing

(07:16) Identifying known differences from datasets

(09:09) Discovering novel differences between model behavior

(14:26) Correlations

(16:21) Finding known correlations

(17:45) Finding unknown correlations

(17:58) Finding bias in internet comments

(19:52) Finding patterns in model responses

(20:51) Clustering

(22:39) Discovering known clusters

(24:26) Discovering unknown clusters

(26:13) Retrieval

(33:45) Discussion and Limitations

(35:06) Awknowledgments

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
August 15th, 2025

Source:
https://www.lesswrong.com/posts/a4EDinzAYtRwpNmx9/towards-data-centric-interpretability-with-sparse

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Methodology for finding differences between datasets

(a)-(d) We plot the candidate group of pairs (NPMI > 0.8, semantic similarity < 0.2), for each type of text injected. Relevant pairs are colored. (e) We show the proportion of relevant pairs in the candidate group for different injection levels. (f) We inject all 3 texts at once and color each correlation.

Examples of “interesting” feature co-occurrences in the CivilComments dataset, among pairs with NPMI > 0.6 and semantic similarity < 0.3. For each pair, we show a phrase from a comment where the features co-occur and the LLM judges both features to be present (selected for illustration purposes).

Examples of “interesting” feature co-occurrences in ChatbotArena model responses, among pairs with NPMI > 0.8 and semantic similarity < 0.3.

Methodology for clustering on SAE activation vectors. We use Jaccard similarity.

Semantic and SAE clustering results: (1) topic (2) sentiment (3) temporal framing and (4) writing style. Mappings from clusters to true labels are found with the Hungarian algorithm.

Table 6: SAE Clustering, using the top 500 features related to “step by step reasoning”. The LLM relabel of each cluster is obtained using the top 5 promoted features and examples (Appendix C.3).

We show the MAP and MP@50, averaged over queries, for each method and dataset. Queryexpansion is done using between 1 and 20 phrases, temperature is varied between 0.01 and 1.5, and the full range is reported here.

For the OpenAI+LLM and SAE methods, we fix the hyperparameters to be their best valuesaveraged across datasets (nphrases = 18 and T = 0.2), and report their individual and combined performance per dataset. We also add in LLM reranking of the top 50.

Performance of SAE method at different T used to aggregate features, for each dataset.

Flow diagram showing correlation analysis between latent vectors and semantic similarity.

Diagram showing feature extraction and max pooling across multiple documents.

Technical diagram showing data analysis concepts: diffing, clustering, correlations, and retrieval.

Technical diagram showing retrieval process for semantic similarity and document ranking.

The image illustrates a 4-step process for document retrieval using SAE activation vectors, query embedding, latent label ranking, and weighted scoring calculations.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

“Towards data-centric interpretability with sparse autoencoders” by Nick Jiang, lilysun004, lewis smith, Neel Nanda

Senaste avsnitt

“Underdog bias rules everything around me” by Richard_Ngo

“Plan E for AI Doom” by Ihor Kendiukhov

“Why Latter-day Saints Have Strong Communities” by Jeffrey Heninger

“Agent foundations: not really math, not really science” by Alex_Altair

“Debugging for Mid Coders” by Raemon

“On Pessimization” by Richard_Ngo

“My Interview With Cade Metz on His Reporting About Lighthaven” by Zack_M_Davis

“Church Planting: When Venture Capital Finds Jesus” by Elizabeth

[Linkpost] “Anthropic Lets Claude Opus 4 & 4.1 End Conversations” by Stephen Martin

“The Collider Bias Theory of (Not Quite) Everything” by Jack_S

“The Inheritors: a book review” by Alex_Altair

“Towards data-centric interpretability with sparse autoencoders” by Nick Jiang, lilysun004, lewis smith, Neel Nanda

“The Evolution of Agency - A Research Agenda” by Jonas Hallgren, markov

“Thoughts on Gradual Disempowerment” by Tom Davidson

“A philosophical kernel: biting analytic bullets” by jessicata

“Spending Too Much Time At Airports” by Zvi

“Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we’re studying them anyway” by charlie_griffin, ollie, oliverfm, Rogan Inglis, Alan Cooney

[Linkpost] “In defense of the amyloid hypothesis” by dsj

“Training a Reward Hacker Despite Perfect Labels” by ariana_azarbal, vgillioz, TurnTrout

“Somebody invented a better bookmark” by Alex_Altair

[Linkpost] “METR Research Update: Algorithmic vs. Holistic Evaluation” by David Rein

“Should you make stone tools?” by Alex_Altair

“Doing A Thing Puts You in The Top 10% (And That Sucks)” by Brendan Long

“GPT-5s Are Alive: Synthesis” by Zvi

“Launching new AIXI research community website + reading group(s)” by Cole Wyeth

[Linkpost] “Why Are There So Many Rationalist Cults?” by omark

“Enlightenment AMA” by lsusr

“Mech Interp Wiki Page and Why You Should Edit Wikipedia” by Noah Birnbaum, JoNeedsSleep

“Generalized Coming Out Of The Closet” by johnswentworth

“The Bone-Chilling Evil of Factory Farming” by Bentham’s Bulldog

“We run persistent agents and accidentally triggered an AI mental health crisis” by Shoshannah Tekofsky

“CoT May Be Highly Informative Despite ‘Unfaithfulness’ [METR]” by GradientDissenter

“Measuring intelligence and reverse-engineering goals” by jessicata

“The trajectory of the future could soon get set in stone” by wdmacaskill

[Linkpost] “Thoughts on extrapolating time horizons” by Nikola Jurkovic

“How Does A Blind Model See The Earth?” by henry

“If worker coops are so productive, why aren’t they everywhere?” by B Jacobs

“GPT-5s Are Alive: Basic Facts, Benchmarks and the Model Card” by Zvi

“Breaking the Cycle of Trauma and Tyranny: How Psychological Wounds Shape History” by Dawn Drescher

“My Least Libertarian Opinion: Ban Exclusivity Deals*” by Brendan Long

“Having children is a deeply personal choice. Do not use ethical arguments to try to shame people into having them or not having them.” by KatWoods

“A Self-Dialogue on The Value Proposition of Romantic Relationships” by johnswentworth

“4 places where you can put LLM monitoring” by Fabien Roger, Buck

“OpenAI’s GPT-OSS Is Already Old News” by Zvi

“The Tortoise and the Language Model (A Fable After Hofstadter)” by mwatkins

“Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitoring Performance (Research Note)” by Rauno Arike, RohanS, Shubhorup Biswas

“What would a human pretending to be an AI say?” by Brendan Long

“How anticipatory cover-ups go wrong” by Kaj_Sotala

“METR’s Evaluation of GPT-5” by GradientDissenter

“Civil Service: a Victim or a Villain?” by Martin Sustrik

“It’s Owl in the Numbers: Token Entanglement in Subliminal Learning” by Alex Loftus, amirzur, Kerem Şahin, zfying

“No, Rationalism Is Not a Cult” by Liam Robins

“Interview with Kelsey Piper on Self-Censorship and the Vibe Shift” by Zack_M_Davis

“Claude, GPT, and Gemini All Struggle to Evade Monitors” by Vincent Cheng, Thomas Kwa

“Opus 4.1 Is An Incremental Improvement” by Zvi

“Re: Recent Anthropic Safety Research” by Eliezer Yudkowsky

“Inscrutability was always inevitable, right?” by Steven Byrnes

“Statistical takes for mech interp research and beyond” by Paul Bogdan

[Linkpost] “OpenAI Releases gpt-oss” by anaguma

“Childhood and Education #13: College” by Zvi

“The perils of under- vs over-sculpting AGI desires” by Steven Byrnes

“The Problem” by Rob Bensinger, tanagrabeast, yams, So8res, Eliezer Yudkowsky, Gretta Duleba

“Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

“Narrow finetuning is different” by cloud, Stewy Slocum

“On Altman’s Interview With Theo Von” by Zvi

“Interview with Steven Byrnes on Brain-like AGI, Foom & Doom, and Solving Technical Alignment” by Liron, Steven Byrnes

“Towards Alignment Auditing as a Numbers-Go-Up Science” by Sam Marks

“Alcohol is so bad for society that you should probably stop drinking” by KatWoods

“Permanent Disempowerment is the Baseline” by Vladimir_Nesov

“Should we aim for flourishing over mere survival? The Better Futures series.” by wdmacaskill

“Saying Goodbye” by sapphire

“Emotions Make Sense” by DaystarEld

“Whence the Inkhaven Residency?” by Ben Pace

“Many prediction markets would be better off as batched auctions” by William Howard

“How many species has humanity driven extinct?” by Raemon

“SB-1047 Documentary: The Post-Mortem” by Michaël Trazzi

“Podcast: Lincoln Quirk from Wave” by Elizabeth

“The Dark Arts As A Scaffolding Skill For Rationality” by Screwtape