Sveriges mest populära poddar
LessWrong (Curated & Popular)

"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al

10 min27 september 2023

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models

We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.

Source:
https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

[125+ Karma Post]

Fler avsnitt av LessWrong (Curated & Popular)

Visa alla avsnitt av LessWrong (Curated & Popular)

LessWrong (Curated & Popular) med LessWrong finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.