Bra podd

Audio note: this article contains 113 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

This research was completed during the Mentorship for Alignment Research Students (MARS 2.0) Supervised Program for Alignment Research (SPAR spring 2025) programs. The team was supervised by Stefan (Apollo Research). Jai and Sara were the primary contributors, Stefan contributed ideas, ran final experiments and helped writing the post. Giorgi contributed in the early phases of the project. All results can be replicated using this codebase.

Summary

We investigate the toy model of Compressed Computation (CC), introduced by Braun et al. (2025), which is a model that seemingly computes more non-linear functions (100 target ReLU functions) than it has ReLU neurons (50). Our results cast doubt on whether the mechanism behind this toy model is indeed computing more functions [...]

---

Outline:

(00:59) Summary

(02:42) Introduction

(04:38) Methods

(06:34) Results

(06:37) Qualitatively different solutions in sparse vs. dense input regimes

(09:49) Quantitative analysis of the Compressed Computation model

(13:09) Mechanism of the Compressed Computation model

(18:11) Mechanism of the dense solution

(20:55) Discussion

The original text contained 9 footnotes which were omitted from this narration.

---

First published:
June 23rd, 2025

Source:
https://www.lesswrong.com/posts/ZxFchCFJFcgysYsT9/compressed-computation-is-probably-not-computation-in

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Figure 1: The original model architecture from Braun et al. (2025), and our simpler equivalent model.

Figure 2: Loss per feature (<span>__T3A_INLINE_LATEX_PLACEHOLDER___L/p___T3A_INLINE_LATEX_END_PLACEHOLDER__</span>) as a function of evaluation sparsity. Each solid line corresponds to a model trained at a given sparsity. The models learn one of two solution types, depending on the input sparsity used during training: the “compressed computation” (CC) solution (violet) or a dense solution (green). Both types beat the naive baseline (dashed line) in their respective regime. Black circles connected by a dotted line represent the results seen by Braun et al. (2025), where models were evaluated only at their training sparsity.

Figure 3: Input/output behaviour of the two model types (for one-hot inputs): In the “compressed computation” solution (left panel), all features are similarly-well represented: each input activates the corresponding output feature. In contrast, the dense solution (right panel) shows a strong (and more accurate) response for half the features, while barely responding to the other half. The green dashed line indicates the expected response under perfect performance.

$Figure 4: Weights representing each input feature, split by neuron. Each bar corresponds to a feature (x-axis) and shows the adjusted weight value from __T3A_INLINE_LATEX_PLACEHOLDER___W_{\text{out}} \odot W_{\text{in}}___T3A_INLINE_LATEX_END_PLACEHOLDER__, split by neuron index (color). The CC solution (left) combinations of neurons to represent each feature (to around 70%), whereas the dense solution (right) allocates a single neuron to fully (~100%) represent 50 out of 100 features.$

$Figure 5, left: Loss per feature as a function of input sparsity, for different choices of __T3A_INLINE_LATEX_PLACEHOLDER___M___T3A_INLINE_LATEX_END_PLACEHOLDER__. We compare an embedding-like __T3A_INLINE_LATEX_PLACEHOLDER___M___T3A_INLINE_LATEX_END_PLACEHOLDER__ (Braun et al. 2025, blue) to a fully random __T3A_INLINE_LATEX_PLACEHOLDER___M___T3A_INLINE_LATEX_END_PLACEHOLDER__ (green) and a symmetric __T3A_INLINE_LATEX_PLACEHOLDER___M___T3A_INLINE_LATEX_END_PLACEHOLDER__ (red) which is a random lower diagonal matrix mirrored; in both cases we set the magnitude if __T3A_INLINE_LATEX_PLACEHOLDER___M___T3A_INLINE_LATEX_END_PLACEHOLDER__ that leads to the lowest loss. For comparison we also show a model trained on __T3A_INLINE_LATEX_PLACEHOLDER___M=0___T3A_INLINE_LATEX_END_PLACEHOLDER__ (yellow). We find that all non-zero __T3A_INLINE_LATEX_PLACEHOLDER___M___T3A_INLINE_LATEX_END_PLACEHOLDER__ lead to a qualitatively similar profile, and that a symmetrized random __T3A_INLINE_LATEX_PLACEHOLDER___M___T3A_INLINE_LATEX_END_PLACEHOLDER__ almost gives the same result as Braun et al. (2025).Right: Optimal loss as a function of mixing matrix magnitude __T3A_INLINE_LATEX_PLACEHOLDER___\sigma___T3A_INLINE_LATEX_END_PLACEHOLDER__ (separately trained for every __T3A_INLINE_LATEX_PLACEHOLDER___\sigma___T3A_INLINE_LATEX_END_PLACEHOLDER__). For small __T3A_INLINE_LATEX_PLACEHOLDER___\sigma___T3A_INLINE_LATEX_END_PLACEHOLDER__ the loss linearly decreases with the mixing matrix magnitude, suggesting the loss advantage over the naive solution stems from the mixing matrix __T3A_INLINE_LATEX_PLACEHOLDER___M___T3A_INLINE_LATEX_END_PLACEHOLDER__. At large values of __T3A_INLINE_LATEX_PLACEHOLDER___\sigma___T3A_INLINE_LATEX_END_PLACEHOLDER__, the loss increases again.$

$Figure 6: Training a model on the noisy dataset (__T3A_INLINE_LATEX_PLACEHOLDER___M \neq 0___T3A_INLINE_LATEX_END_PLACEHOLDER__), and then fine-tuning on the clean __T3A_INLINE_LATEX_PLACEHOLDER___M=0___T3A_INLINE_LATEX_END_PLACEHOLDER__ case. We see that the loss jumps back up as soon as we switch to the clean task. This is evidence against the hypothesis that the CC solution wasn't learned on the clean case just due to training dynamics.$

$Figure 7, left: Cosine similarity between various eigen- and singular vectors (x-axis) and MLP neuron directions (y-axis). We show eigenvectors in the top panels, singular vectors in the bottom panels, the __T3A_INLINE_LATEX_PLACEHOLDER___W_{\rm in}___T3A_INLINE_LATEX_END_PLACEHOLDER__ matrix in the left panels, and the __T3A_INLINE_LATEX_PLACEHOLDER___W_{\rm out}___T3A_INLINE_LATEX_END_PLACEHOLDER__ matrix in the right panels. In all cases we see that the top-50 vectors (sorted by eigen- / singular value) have significant dot product with the neurons, while the remaining 50 vectors have near-zero dot products (black).Right: We test how well the ReLU-free MLP (i.e. just the __T3A_INLINE_LATEX_PLACEHOLDER___W_{\rm out} W_{\rm in}___T3A_INLINE_LATEX_END_PLACEHOLDER__ projection) preserves various eigen- (orange) and singular (blue) directions. Confirming the previous result, we find the cosine similarity between the vectors before and after the projection to be high for only the top 50 vectors.$

$Figure 8, left: Scatter plot of the entries of the product __T3A_INLINE_LATEX_PLACEHOLDER___W_{\rm out} W_{\rm in}___T3A_INLINE_LATEX_END_PLACEHOLDER__ and the mixing matrix __T3A_INLINE_LATEX_PLACEHOLDER___M___T3A_INLINE_LATEX_END_PLACEHOLDER__; the entries are clearly correlated. The diagonal entries of __T3A_INLINE_LATEX_PLACEHOLDER___W_{\rm out} W_{\rm in}___T3A_INLINE_LATEX_END_PLACEHOLDER__ are offset by a constant, and seem to be correlated at a higher slope.Right: Visualization of the MLP weight matrices __T3A_INLINE_LATEX_PLACEHOLDER___W_{\rm in}___T3A_INLINE_LATEX_END_PLACEHOLDER__ and __T3A_INLINE_LATEX_PLACEHOLDER___W_{\rm out}___T3A_INLINE_LATEX_END_PLACEHOLDER__. We highlight that __T3A_INLINE_LATEX_PLACEHOLDER___W_{\rm in}___T3A_INLINE_LATEX_END_PLACEHOLDER__ is has mostly positive entries (this makes sense as it feeds into the ReLU), and both matrices have a small number of large entries.$

$Figure 9: Loss of the SNMF solution, compared to the naive solution. Like in Figure 5b, the solution does better than the naive loss for a range of __T3A_INLINE_LATEX_PLACEHOLDER___\sigma___T3A_INLINE_LATEX_END_PLACEHOLDER__ values though the range is smaller and the loss is higher than for the trained model.$

$Figure 10, left: A non-zero offset in the __T3A_INLINE_LATEX_PLACEHOLDER___W_{\rm out}___T3A_INLINE_LATEX_END_PLACEHOLDER__ entries of unrepresented features improves the loss in the dense regime. We determine the optimal value empirically for each input feature probability __T3A_INLINE_LATEX_PLACEHOLDER___p___T3A_INLINE_LATEX_END_PLACEHOLDER__.Right: This hand-coded naive + offset model (dashed lines) consistently matches or outperforms the model trained on clean labels (solid lines) in the dense regime. (Note that this plot only shows the clean dataset (__T3A_INLINE_LATEX_PLACEHOLDER___M=0___T3A_INLINE_LATEX_END_PLACEHOLDER__) which is why no solution outperforms the naive loss in the sparse regime.)$

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Compressed Computation is (probably) not Computation in Superposition” by Jai Bhagat, Sara Molas Medina, Giorgi Giglemiani, StefanHex

Summary

Senaste avsnitt

“against that one rationalist mashal about japanese fifth-columnists” by Fraser

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

“Vitalik’s Response to AI 2027” by Daniel Kokotajlo

“the jackpot age” by thiccythot

“Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” by habryka

[Linkpost] “Guide to Redwood’s writing” by Julian Stastny

“So You Think You’ve Awoken ChatGPT” by JustisMills

[Linkpost] “Open Global Investment as a Governance Model for AGI” by Nick Bostrom

“what makes Claude 3 Opus misaligned” by janus

“Lessons from the Iraq War about AI policy” by Buck

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

“Evaluating and monitoring for AI scheming” by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah

“White Box Control at UK AISI - Update on Sandbagging Investigations” by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

“80,000 Hours is producing AI in Context — a new YouTube channel. Our first video, about the AI 2027 scenario, is up!” by chanamessinger

[Linkpost] “No, We’re Not Getting Meaningful Oversight of AI” by Davidmanheim

“What’s worse, spies or schemers?” by Buck, Julian Stastny

“No, Grok, No” by Zvi

“Applying right-wing frames to AGI (geo)politics” by Richard_Ngo

“A deep critique of AI 2027’s bad timeline models” by titotal

“Subway Particle Levels Aren’t That High” by jefftk

“An Opinionated Guide to Using Anki Correctly” by Luise

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

“Balsa Update: Springtime in DC” by Zvi

[Linkpost] “A Theory of Structural Independence” by Matthias G. Mayer

“On Alpha School” by Zvi

“You Can’t Objectively Compare Seven Bees to One Human” by J Bostock

“Literature Review: Risks of MDMA” by Elizabeth

“45 - Samuel Albanie on DeepMind’s AGI Safety Approach” by DanielFilan

“On the functional self of LLMs” by eggsyntax

“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish

“The Cult of Pain” by Martin Sustrik

[Linkpost] “Claude is a Ravenclaw” by Adam Newgas

“‘Buckle up bucko, this ain’t over till it’s over.’” by Raemon

“How much novel security-critical infrastructure do you need during the singularity?” by Buck

“‘AI for societal uplift’ as a path to victory” by Raymond Douglas

“Two proposed projects on abstract analogies for scheming” by Julian Stastny

“Outlive: A Critical Review” by MichaelDickens

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout

[Linkpost] “MIRI Newsletter #123” by Harlan, Rob Bensinger

[Linkpost] “Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals” by Marius Hobbhahn

“Call for suggestions - AI safety course” by boazbarak

[Linkpost] “IABIED: Advertisement design competition” by yams

“Congress Asks Better Questions” by Zvi

“Curing PMS with Hair Loss Pills” by David Lorell

“AI Task Length Horizons in Offensive Cybersecurity” by Sean Peters

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

“There are two fundamentally different constraints on schemers” by Buck

“‘What’s my goal?’” by Raemon

“A Simple Explanation of AGI Risk” by TurnTrout

“AI Moratorium Stripped From BBB” by Zvi

“Scientific Discovery in the Age of Artificial Intelligence” by Jessica Rumbelow

“SLT for AI Safety” by Jesse Hoogland

“The best simple argument for Pausing AI?” by Gary Marcus

“SAE on activation differences” by Santiago Aranguri, jacob_drori, Neel Nanda

“What We Learned Trying to Diff Base and Chat Models (And Why It Matters)” by Clément Dumas, Julian Minder, Neel Nanda

“If you want to be vegan but you worry about health effects of no meat, consider being vegan except for mussels/oysters” by KatWoods

[Linkpost] “Project Vend: Can Claude run a small shop?” by Gunnar_Zarncke

“Paradigms for computation” by Cole Wyeth

“life lessons from poker” by thiccythot

“Circuits in Superposition 2: Now with Less Wrong Math” by Linda Linsefors, Lucius Bushnaq

“I underestimated safety research speedups from safe AI” by Dan Braun

“Conciseness Manifesto” by Vasyl Dotsenko

“Support for bedrock liberal principles seems to be in pretty bad shape these days” by Max H

[Linkpost] “A Depressed Shrink Tries Shrooms” by AlphaAndOmega

[Linkpost] “[Paper] Stochastic Parameter Decomposition” by Lee Sharkey, Lucius Bushnaq, Dan Braun

“Childhood and Education #11: The Art of Learning” by Zvi

“Proposal for making credible commitments to AIs.” by Cleo Nardo

“Epoch: What is Epoch?” by Zach Stein-Perlman

“Recent and forecasted rates of software and hardware progress” by elifland

“Jankily controlling superintelligence” by ryan_greenblatt

“Help the AI 2027 team make an online AGI wargame” by Jonas V

“A Guide For LLM-Assisted Web Research” by nikos, dschwarz, Lawrence Phillips, FutureSearch

“If Not Now, When?” by Yair Halberstadt

“A case for courage, when speaking of AI danger” by So8res

“The Industrial Explosion” by rosehadshar, Tom Davidson

“Summary of John Halstead’s Book-Length Report on Existential Risks From Climate Change” by Bentham’s Bulldog

“Tech for Thinking” by sarahconstantin