Sveriges mest populära poddar

LessWrong (Curated & Popular)

Samhälle och kultur Teknologi

"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al

10 min•27 september 2023

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models

We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.

Source:
https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

[125+ Karma Post] ✓

Fler avsnitt av LessWrong (Curated & Popular)

"What is up with e/acc?" by KatjaGrace

27 juni•4 min

"Existential AI safety needs an effective social movement. PauseAI is building it" by Maxime Fournes, Espedair Street

27 juni•1 tim 3 min

"Surprising facts about the slave trade" by Joseph Miller

26 juni•13 min

"AI catastrophe: more like a genocide than a thought experiment" by KatjaGrace

26 juni•2 min

"AI pause: the case for ASAP" by KatjaGrace

25 juni•2 min

"The Invisible Side of AI Governance" by Charbel-Raphaël

23 juni•28 min

"A Theory of Prompt Injection (and why you should study roles)" by Charles Ye, softboiledheart

23 juni•32 min

"Machinic Psychopharmacology: Do LLMs Self-Medicate?" by Sid Black, Joseph Bloom

22 juni•53 min

"Can activation verbalizers surface an internal chain of thought?" by oakhu, ryan_greenblatt

22 juni•1 tim 20 min

"The LLM shoggoth meme is weirder than you think" by HedonicEscalator

21 juni•14 min

LessWrong (Curated & Popular) med LessWrong finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.