Start / LessWrong (30+ Karma) / Re recent anthropic safety research by eliezer yudkowsky

“Re: Recent Anthropic Safety Research” by Eliezer Yudkowsky

9 min • 6 augusti 2025

A reporter asked me for my off-the-record take on recent safety research from Anthropic. After I drafted an off-the-record reply, I realized that I was actually fine with it being on the record, so:

Since I never expected any of the current alignment technology to work in the limit of superintelligence, the only news to me is about when and how early dangers begin to materialize. Even taking Anthropic's results completely at face value would change not at all my own sense of how dangerous machine superintelligence would be, because what Anthropic says they found was already very solidly predicted to appear at one future point or another. I suppose people who were previously performing great skepticism about how none of this had ever been seen in ~Real Life~, ought in principle to now obligingly update, though of course most people in the AI industry won't. Maybe political leaders [...]

---

First published:
August 6th, 2025

Source:
https://www.lesswrong.com/posts/oDX5vcDTEei8WuoBx/re-recent-anthropic-safety-research

---

Narrated by TYPE III AUDIO.

Senaste avsnitt

“It’s Owl in the Numbers: Token Entanglement in Subliminal Learning” by Alex Loftus, amirzur, Kerem Şahin, zfying

7 augusti | 11 min

“No, Rationalism Is Not a Cult” by Liam Robins

7 augusti | 20 min

“Interview with Kelsey Piper on Self-Censorship and the Vibe Shift” by Zack_M_Davis

7 augusti | 26 min

“Claude, GPT, and Gemini All Struggle to Evade Monitors” by Vincent Cheng, Thomas Kwa

7 augusti | 12 min

“The Problem” by Rob Bensinger, tanagrabeast, yams, So8res, Eliezer Yudkowsky, Gretta Duleba

5 augusti | 50 min

“Concept Poisoning: Probing LLMs without probes” by Jan Betley, jorio, dylan_f, Owain_Evans

5 augusti | 33 min

“Narrow finetuning is different” by cloud, Stewy Slocum

5 augusti | 7 min

“On Altman’s Interview With Theo Von” by Zvi

5 augusti | 17 min

“Interview with Steven Byrnes on Brain-like AGI, Foom & Doom, and Solving Technical Alignment” by Liron, Steven Byrnes

5 augusti | 154 min

“Towards Alignment Auditing as a Numbers-Go-Up Science” by Sam Marks

4 augusti | 18 min

“Alcohol is so bad for society that you should probably stop drinking” by KatWoods

4 augusti | 16 min

“Permanent Disempowerment is the Baseline” by Vladimir_Nesov

4 augusti | 11 min

“Should we aim for flourishing over mere survival? The Better Futures series.” by wdmacaskill

4 augusti | 9 min

“Saying Goodbye” by sapphire

4 augusti | 9 min

“Emotions Make Sense” by DaystarEld

3 augusti | 36 min

“Whence the Inkhaven Residency?” by Ben Pace

2 augusti | 5 min

“Many prediction markets would be better off as batched auctions” by William Howard

2 augusti | 9 min

“How many species has humanity driven extinct?” by Raemon

2 augusti | 1 min

“SB-1047 Documentary: The Post-Mortem” by Michaël Trazzi

2 augusti | 10 min

“Podcast: Lincoln Quirk from Wave” by Elizabeth

2 augusti | 2 min

“The Dark Arts As A Scaffolding Skill For Rationality” by Screwtape

1 augusti | 11 min

“Steve Petersen funding” by abramdemski

1 augusti | 1 min

“Two Kinds of Do Overs” by jefftk

1 augusti | 4 min

“Red-Thing-Ism” by J Bostock

1 augusti | 6 min

“Do Not Render Your Counterfactuals” by AlphaAndOmega

1 augusti | 10 min

“Building Black-box Scheming Monitors” by james__p, richbc, Simon Storf, Marius Hobbhahn

31 juli | 24 min

“Follow-up to ‘My Empathy Is Rarely Kind’” by johnswentworth

31 juli | 4 min

“I am worried about near-term non-LLM AI developments” by testingthewaters

31 juli | 11 min

“Childhood and Education: College Admissions” by Zvi

31 juli | 34 min

“Optimizing The Final Output Can Obfuscate CoT (Research Note)” by lukemarks, jacob_drori, cloud, TurnTrout

30 juli | 12 min

“China proposes new global AI cooperation organisation” by Matrice Jacobine

30 juli | 2 min

“My Empathy Is Rarely Kind” by johnswentworth

30 juli | 6 min

“The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)” by GideonF

30 juli | 18 min

“Spilling the Tea” by Zvi

29 juli | 23 min

“I wrote a song parody” by CronoDAS

29 juli | 3 min

“Low P(x-risk) as the Bailey for Low P(doom)” by Vladimir_Nesov

29 juli | 5 min

“About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong” by bohaska

29 juli | 7 min

“Procrastination Drill” by silentbob

29 juli | 5 min

“Teaching kids to swim” by Steven Byrnes

29 juli | 5 min

“Recursions on LessOnline 2025” by Error

29 juli | 33 min

“Simplex Progress Report - July 2025” by Adam Shai, Paul Riechers, hrbigelow, Eric Alt, mntss

29 juli | 33 min

“Optimally Combining Probe Monitors and Black Box Monitors” by Tim Hua, jamesbaskerville, BionicD0LPH1N, Mia Hopman, Aryan Bhatt, Tyler Tracy

28 juli | 13 min

“AI Companion Piece” by Zvi

28 juli | 29 min

“This Is Not Life” by samhealy

28 juli | 44 min

“Sydney Bing Wikipedia Article: Sydney (Microsoft Prometheus)” by jdp

28 juli | 14 min

“Maya’s Escape” by Bridgett Kay

27 juli | 20 min

[Linkpost] “The Purpose of a System is what it Rewards” by robotelvis

27 juli | 3 min

“my experience on glp-1s as a thin person” by AnnaJo

26 juli | 17 min

“Anthropic Faces Potentially ‘Business-Ending’ Copyright Lawsuit” by garrison

26 juli | 14 min

“HPMOR: The (Probably) Untold Lore” by Gretta Duleba, Eliezer Yudkowsky

25 juli | 68 min

“We Built a Tool to Protect Your Dataset From Simple Scrapers” by TurnTrout, Edward Turner, Dipika Khullar

25 juli | 6 min

[Linkpost] “Reasoning-Finetuning Repurposes Latent Representations in Base Models” by Jake Ward, lccqqqqq, Neel Nanda

25 juli | 6 min

“Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

24 juli | 11 min

“The Whole Check” by JustisMills

24 juli | 7 min

“‘Behaviorist’ RL reward functions lead to scheming” by Steven Byrnes

24 juli | 21 min

[Linkpost] “A brief perspective from an IMO coordinator” by DirectedEvolution

23 juli | 2 min

“Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning” by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda

23 juli | 12 min

“On ‘ChatGPT Psychosis’ and LLM Sycophancy” by jdp

23 juli | 30 min

“Google and OpenAI Get 2025 IMO Gold” by Zvi

23 juli | 58 min

“Unfaithful chain-of-thought as nudged reasoning” by Paul Bogdan, Uzay Macar, Arthur Conmy, Neel Nanda

23 juli | 20 min

“Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data” by cloud, mle, Owain_Evans

22 juli | 10 min

“Directly Try Solving Alignment for 5 weeks” by Kabir Kumar

22 juli | 14 min

[Linkpost] “Why Reality Has A Well-Known Math Bias” by Linch

22 juli | 3 min

“Do ‘adult developmental stages’ theories have any pre-theoretical motivation?” by Said Achmiz

22 juli | 6 min

“Monthly Roundup #32: July 2025” by Zvi

21 juli | 75 min

“If Anyone Builds It, Everyone Dies: Call for Translators (for Supplementary Materials)” by yams

21 juli | 3 min

“Detecting High-Stakes Interactions with Activation Probes” by Arrrlex, williambankes, Urja Pawar, Phil Bland, David Scott Krueger (formerly: capybaralet), Dmitrii Krasheninnikov

21 juli | 11 min

“HRT in Menopause: A candidate for a case study of epistemology in epidemiology, statistics & medicine” by foodforthought

21 juli | 7 min

[Linkpost] “GDM also claims IMO gold medal” by Yair Halberstadt

21 juli | 1 min

“[Fiction] Our Trial” by Nina Panickssery

21 juli | 7 min

“LLMs Can’t See Pixels or Characters” by Brendan Long

21 juli | 9 min

“Plato’s Trolley” by dr_s

21 juli | 12 min

“Your AI Safety org could get EU funding up to €9.08M. Here’s how (+ free personalized support)” by SamuelK

20 juli | 6 min

“Shallow Water is Dangerous Too” by jefftk

20 juli | 3 min

“Make More Grayspaces” by Duncan Sabien (Inactive)

20 juli | 23 min

[Linkpost] “AI Gets IMO Gold Medal: via general-purpose RL, not via narrow, task specific methodology” by Mikhail Samin

19 juli | 4 min

“A night-watchman ASI as a first step toward a great future” by Eric Neyman

19 juli | 21 min

“Love stays loved (formerly ‘Skin’)” by Swimmer963 (Miranda Dixon-Luinenburg)

18 juli | 51 min

“Why it’s hard to make settings for high-stakes control research” by Buck

18 juli | 7 min

“On METR’s AI Coding RCT” by Zvi

18 juli | 21 min

“Trying the Obvious Thing” by PranavG, Gabriel Alfour

18 juli | 7 min

“Video and transcript of talk on ‘Can goodness compete?’” by Joe Carlsmith

17 juli | 67 min

“On being sort of back and sort of new here” by Loki zen

17 juli | 5 min

“Comment on ‘Four Layers of Intellectual Conversation’” by Zack_M_Davis

17 juli | 10 min

“Selective Generalization: Improving Capabilities While Maintaining Alignment” by ariana_azarbal, Matthew A. Clarke, jorio, Cailley Factor, cloud

17 juli | 18 min

“Bodydouble / Thinking Assistant matchmaking” by Raemon

17 juli | 4 min

“Kimi K2” by Zvi

16 juli | 27 min

“Grok 4 Various Things” by Zvi

16 juli | 74 min

“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah

15 juli | 2 min

“Do confident short timelines make sense?” by TsviBT, abramdemski

15 juli | 131 min

[Linkpost] “LLM-induced craziness and base rates” by Kaj_Sotala

15 juli | 4 min

[Linkpost] “Bernie Sanders (I-VT) mentions AI loss of control risk in Gizmodo interview” by Matrice Jacobine

14 juli | 3 min

“Recent Redwood Research project proposals” by ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson

14 juli | 8 min

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

14 juli | 11 min

“Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings” by Casey Barkan, Sid Black, Oliver Sourbut

14 juli | 24 min

“Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance” by Senthooran Rajamanoharan, Neel Nanda

14 juli | 19 min

“Worse Than MechaHitler” by Zvi

14 juli | 46 min

“How Does Time Horizon Vary Across Domains?” by Thomas Kwa

14 juli | 37 min

“xAI’s Grok 4 has no meaningful safety guardrails” by eleventhsavi0r

14 juli | 11 min

“Stop and check! The parable of the prince and the dog” by Dumbledore’s Army

14 juli | 3 min

“OpenAI Model Differentiation 101” by Zvi

14 juli | 21 min

“10x more training compute = 5x greater task length (kind of)” by Expertium

14 juli | 5 min

“Three Missing Cakes, or One Turbulent Critic?” by Benquo

14 juli | 5 min

“You can get LLMs to say almost anything you want” by Kaj_Sotala

13 juli | 24 min

“against that one rationalist mashal about japanese fifth-columnists” by Fraser

13 juli | 6 min

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

13 juli | 12 min

“Vitalik’s Response to AI 2027” by Daniel Kokotajlo

12 juli | 24 min

“the jackpot age” by thiccythot

12 juli | 13 min

“Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” by habryka

11 juli | 12 min

[Linkpost] “Guide to Redwood’s writing” by Julian Stastny

11 juli | 1 min

“So You Think You’ve Awoken ChatGPT” by JustisMills

11 juli | 18 min

[Linkpost] “Open Global Investment as a Governance Model for AGI” by Nick Bostrom

10 juli | 2 min

“what makes Claude 3 Opus misaligned” by janus

10 juli | 9 min

“Lessons from the Iraq War about AI policy” by Buck

10 juli | 8 min

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

10 juli | 12 min

“Evaluating and monitoring for AI scheming” by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah

10 juli | 11 min

“White Box Control at UK AISI - Update on Sandbagging Investigations” by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

10 juli | 41 min

“80,000 Hours is producing AI in Context — a new YouTube channel. Our first video, about the AI 2027 scenario, is up!” by chanamessinger

10 juli | 6 min

[Linkpost] “No, We’re Not Getting Meaningful Oversight of AI” by Davidmanheim

10 juli | 2 min

“What’s worse, spies or schemers?” by Buck, Julian Stastny

9 juli | 10 min

“Applying right-wing frames to AGI (geo)politics” by Richard_Ngo

9 juli | 6 min

“No, Grok, No” by Zvi

9 juli | 39 min

“A deep critique of AI 2027’s bad timeline models” by titotal

9 juli | 73 min

“Subway Particle Levels Aren’t That High” by jefftk

9 juli | 3 min

“An Opinionated Guide to Using Anki Correctly” by Luise

9 juli | 54 min

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

8 juli | 11 min

“Balsa Update: Springtime in DC” by Zvi

8 juli | 20 min

[Linkpost] “A Theory of Structural Independence” by Matthias G. Mayer

8 juli | 3 min

“On Alpha School” by Zvi

8 juli | 24 min

“You Can’t Objectively Compare Seven Bees to One Human” by J Bostock

8 juli | 7 min

“Literature Review: Risks of MDMA” by Elizabeth

7 juli | 8 min

“45 - Samuel Albanie on DeepMind’s AGI Safety Approach” by DanielFilan

7 juli | 77 min

“On the functional self of LLMs” by eggsyntax

7 juli | 23 min

“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish

6 juli | 18 min

“The Cult of Pain” by Martin Sustrik

5 juli | 6 min

[Linkpost] “Claude is a Ravenclaw” by Adam Newgas

5 juli | 2 min

“‘Buckle up bucko, this ain’t over till it’s over.’” by Raemon

5 juli | 6 min

“How much novel security-critical infrastructure do you need during the singularity?” by Buck

5 juli | 10 min

“‘AI for societal uplift’ as a path to victory” by Raymond Douglas

4 juli | 4 min

“Two proposed projects on abstract analogies for scheming” by Julian Stastny

4 juli | 7 min

“Outlive: A Critical Review” by MichaelDickens

4 juli | 60 min

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout

4 juli | 11 min

[Linkpost] “MIRI Newsletter #123” by Harlan, Rob Bensinger

4 juli | 4 min

[Linkpost] “Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals” by Marius Hobbhahn

3 juli | 3 min

“Call for suggestions - AI safety course” by boazbarak

3 juli | 3 min

[Linkpost] “IABIED: Advertisement design competition” by yams

3 juli | 2 min

“Congress Asks Better Questions” by Zvi

3 juli | 30 min

“Curing PMS with Hair Loss Pills” by David Lorell

2 juli | 16 min

“AI Task Length Horizons in Offensive Cybersecurity” by Sean Peters

2 juli | 24 min

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

2 juli | 8 min

“There are two fundamentally different constraints on schemers” by Buck

2 juli | 7 min

“‘What’s my goal?’” by Raemon

2 juli | 4 min

“A Simple Explanation of AGI Risk” by TurnTrout

2 juli | 10 min

“AI Moratorium Stripped From BBB” by Zvi

1 juli | 10 min

“Scientific Discovery in the Age of Artificial Intelligence” by Jessica Rumbelow

1 juli | 20 min

“SLT for AI Safety” by Jesse Hoogland

1 juli | 8 min

“The best simple argument for Pausing AI?” by Gary Marcus

1 juli | 2 min

“SAE on activation differences” by Santiago Aranguri, jacob_drori, Neel Nanda

1 juli | 11 min

“What We Learned Trying to Diff Base and Chat Models (And Why It Matters)” by Clément Dumas, Julian Minder, Neel Nanda

30 juni | 20 min

“If you want to be vegan but you worry about health effects of no meat, consider being vegan except for mussels/oysters” by KatWoods

30 juni | 1 min

[Linkpost] “Project Vend: Can Claude run a small shop?” by Gunnar_Zarncke

30 juni | 1 min

“Paradigms for computation” by Cole Wyeth

30 juni | 20 min

“life lessons from poker” by thiccythot

30 juni | 9 min

“Circuits in Superposition 2: Now with Less Wrong Math” by Linda Linsefors, Lucius Bushnaq

30 juni | 38 min

“I underestimated safety research speedups from safe AI” by Dan Braun

30 juni | 6 min

“Conciseness Manifesto” by Vasyl Dotsenko

29 juni | 0 min

“Support for bedrock liberal principles seems to be in pretty bad shape these days” by Max H

29 juni | 7 min

[Linkpost] “A Depressed Shrink Tries Shrooms” by AlphaAndOmega

29 juni | 1 min

[Linkpost] “[Paper] Stochastic Parameter Decomposition” by Lee Sharkey, Lucius Bushnaq, Dan Braun

28 juni | 2 min

“Childhood and Education #11: The Art of Learning” by Zvi

28 juni | 24 min

“Proposal for making credible commitments to AIs.” by Cleo Nardo

28 juni | 5 min

“Epoch: What is Epoch?” by Zach Stein-Perlman

27 juni | 15 min

“Recent and forecasted rates of software and hardware progress” by elifland

27 juni | 2 min

“Jankily controlling superintelligence” by ryan_greenblatt

27 juni | 14 min

“Help the AI 2027 team make an online AGI wargame” by Jonas V

27 juni | 2 min

“A Guide For LLM-Assisted Web Research” by nikos, dschwarz, Lawrence Phillips, FutureSearch

27 juni | 15 min

“If Not Now, When?” by Yair Halberstadt

27 juni | 2 min

“A case for courage, when speaking of AI danger” by So8res

27 juni | 10 min

“The Industrial Explosion” by rosehadshar, Tom Davidson

26 juni | 32 min

“Summary of John Halstead’s Book-Length Report on Existential Risks From Climate Change” by Bentham’s Bulldog

26 juni | 45 min

“Tech for Thinking” by sarahconstantin

26 juni | 4 min

“Lurking in the Noise” by J Bostock

25 juni | 8 min

[Linkpost] “New Paper: Ambiguous Online Learning” by Vanessa Kosoy

25 juni | 3 min

“Melatonin Self-Experiment Results” by silentbob

25 juni | 20 min

“What does 10x-ing effective compute get you?” by ryan_greenblatt

25 juni | 21 min

“A regime-change power-vacuum conjecture about group belief” by TsviBT

25 juni | 6 min

“Analyzing A Critique Of The AI 2027 Timeline Forecasts” by Zvi

24 juni | 55 min

“Why ‘training against scheming’ is hard” by Marius Hobbhahn

24 juni | 24 min

“My pitch for the AI Village” by Daniel Kokotajlo

24 juni | 13 min

“Situational Awareness: A One-Year Retrospective” by Nathan Delisle

24 juni | 33 min

“Compressed Computation is (probably) not Computation in Superposition” by Jai Bhagat, Sara Molas Medina, Giorgi Giglemiani, StefanHex

23 juni | 22 min

“‘It isn’t magic’” by Ben (Berlin)

23 juni | 4 min

“Foom & Doom 1: ‘Brain in a box in a basement’” by Steven Byrnes

23 juni | 59 min

“Foom & Doom 2: Technical alignment is hard” by Steven Byrnes

23 juni | 57 min

“Comparing risk from internally-deployed AI to insider and outsider threats from humans” by Buck

23 juni | 5 min

“Clarifying ‘wisdom’: Foundational topics for aligned AIs to prioritize before irreversible decisions” by Anthony DiGiovanni

23 juni | 27 min

“Racial Dating Preferences and Sexual Racism” by koreindian

23 juni | 64 min

“The Sixteen Kinds of Intimacy” by Ruby

22 juni | 10 min

“Consider chilling out in 2028” by Valentine

21 juni | 24 min

“the sillk pajamas effect” by thiccythot

21 juni | 10 min

“Genomic emancipation” by TsviBT

21 juni | 113 min

“Making deals with early schemers” by Julian Stastny, Olli Järviniemi, Buck

21 juni | 28 min

“AI #121 Part 2: The OpenAI Files” by Zvi

21 juni | 78 min

“Musings on AI Companies of 2025-2026 (Jun 2025)” by Vladimir_Nesov

21 juni | 7 min

“Agentic Misalignment: How LLMs Could be Insider Threats” by Aengus Lynch, Benjamin Wright, Ethan Perez, evhub

20 juni | 12 min

“Did the Army Poison a Bunch of Women in Minnesota?” by rba

20 juni | 10 min

“X explains Z% of the variance in Y” by Leon Lang

20 juni | 19 min

“AI safety techniques leveraging distillation” by ryan_greenblatt

19 juni | 21 min

“Sparsely-connected cross-layer transcoders: preliminary findings” by jacob_drori

19 juni | 33 min

“New Endorsements for ‘If Anyone Builds It, Everyone Dies’” by Malo

18 juni | 9 min

“Fictional Thinking vs Real Thinking” by johnswentworth

18 juni | 8 min

“I made a card game to reduce cognitive biases and logical fallacies but I’m not sure what DV to test in a study on its effectiveness.” by Brad Dunn

18 juni | 9 min

“Prover-Estimator Debate: A New Scalable Oversight Protocol” by Jonah Brown-Cohen, Geoffrey Irving

17 juni | 11 min

“Ok, AI Can Write Pretty Good Fiction Now” by JustisMills

17 juni | 13 min

“Debate experiments at The Curve, LessOnline and Manifest” by Nathan Young

17 juni | 10 min

“Why we’re still doing normal school” by juliawise

17 juni | 5 min

“Endometriosis is an incredibly interesting disease” by Abhishaike Mahajan

17 juni | 35 min

“Estrogen: A trip report” by cube_flipper

17 juni | 51 min

“Intelligence Is Not Magic, But Your Threshold For ‘Magic’ Is Pretty Low” by Expertium

17 juni | 3 min

“Some reprogenetics-related projects you could help with” by TsviBT

17 juni | 8 min

“RTFB: The RAISE Act” by Zvi

17 juni | 16 min

“Model Organisms for Emergent Misalignment” by Anna Soligo, Edward Turner, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda

17 juni | 12 min

“Convergent Linear Representations of Emergent Misalignment” by Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda

17 juni | 19 min

“Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models” by James Chua, Owain_Evans

17 juni | 19 min

[Linkpost] “the void” by nostalgebraist

11 juni | 1 min

“Expectation = intention = setpoint” by jimmy

11 juni | 22 min

“Give Me a Reason(ing Model)” by Zvi

10 juni | 12 min

“Mech interp is not pre-paradigmatic” by Lee Sharkey

10 juni | 30 min

“The True Goal Fallacy” by adamShimi

10 juni | 13 min

“Ghiblification for Privacy” by jefftk

10 juni | 2 min

10 juni | 0 min

“When is it important that open-weight models aren’t released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.” by ryan_greenblatt

9 juni | 17 min

“Administering immunotherapy in the morning seems to really, really matter. Why?” by Abhishaike Mahajan

9 juni | 23 min

[Linkpost] “METR: Recent frontier models are reward hacking” by Daniel Kokotajlo

9 juni | 1 min

“Levels of Doom: Eutopia, Disempowerment, Extinction” by Vladimir_Nesov

9 juni | 4 min

“AI companies’ eval reports mostly don’t support their claims” by Zach Stein-Perlman

9 juni | 8 min

“Busking with Kids” by jefftk

9 juni | 3 min

“Emergent Misalignment on a Budget” by Valerio Pepe

9 juni | 17 min

“Letting Kids Be Outside” by jefftk

8 juni | 8 min

“On working 80%” by adrische

8 juni | 5 min

“Solo Park Play at Three” by jefftk

7 juni | 2 min

“The Mirror Trap” by Cameron Berg

7 juni | 9 min

“LLM in-context learning as (approximating) Solomonoff induction” by Cole Wyeth

6 juni | 9 min

“Re: Recent Anthropic Safety Research” by Eliezer Yudkowsky

Senaste avsnitt

“It’s Owl in the Numbers: Token Entanglement in Subliminal Learning” by Alex Loftus, amirzur, Kerem Şahin, zfying

“No, Rationalism Is Not a Cult” by Liam Robins

“Interview with Kelsey Piper on Self-Censorship and the Vibe Shift” by Zack_M_Davis

“Claude, GPT, and Gemini All Struggle to Evade Monitors” by Vincent Cheng, Thomas Kwa

“Opus 4.1 Is An Incremental Improvement” by Zvi