Start / LessWrong (30+ Karma) / Linkpost ai gets imo gold medal via general purpose rl not via narrow task specific methodology by mikhail samin

[Linkpost] “AI Gets IMO Gold Medal: via general-purpose RL, not via narrow, task specific methodology” by Mikhail Samin

4 min • 19 juli 2025

This is a link post.

I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world's most prestigious math competition—the International Math Olympiad (IMO).

We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.

Why is this a big deal? First, IMO problems demand a new level of sustained creative thinking compared to past benchmarks. In reasoning time horizon, we’ve now progressed from GSM8K (~0.1 min for top humans) → MATH benchmark (~1 min) → AIME (~10 mins) → IMO (~100 mins).

Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can [...]

---

First published:
July 19th, 2025

Source:
https://www.lesswrong.com/posts/RcBqeJ8GHM2LygQK3/ai-gets-imo-gold-medal-via-general-purpose-rl-not-via-narrow

Linkpost URL:
https://x.com/alexwei_/status/1946477742855532918

---

Narrated by TYPE III AUDIO.

---

Images from the article:

https://github.com/aw31/openai-imo-2025-proofs/blob/main/problem_1.txt

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

“Your AI Safety org could get EU funding up to €9.08M. Here’s how (+ free personalized support)” by SamuelK

20 juli | 6 min

“Shallow Water is Dangerous Too” by jefftk

20 juli | 3 min

[Linkpost] “AI Gets IMO Gold Medal: via general-purpose RL, not via narrow, task specific methodology” by Mikhail Samin

Senaste avsnitt

“[Fiction] Our Trial” by Nina Panickssery

“LLMs Can’t See Pixels or Characters” by Brendan Long

“Plato’s Trolley” by dr_s

“Your AI Safety org could get EU funding up to €9.08M. Here’s how (+ free personalized support)” by SamuelK

“Shallow Water is Dangerous Too” by jefftk

“Make More Grayspaces” by Duncan Sabien (Inactive)

[Linkpost] “AI Gets IMO Gold Medal: via general-purpose RL, not via narrow, task specific methodology” by Mikhail Samin

“A night-watchman ASI as a first step toward a great future” by Eric Neyman

“Love stays loved (formerly ‘Skin’)” by Swimmer963 (Miranda Dixon-Luinenburg)

“Why it’s hard to make settings for high-stakes control research” by Buck

“On METR’s AI Coding RCT” by Zvi

“Trying the Obvious Thing” by PranavG, Gabriel Alfour

“Video and transcript of talk on ‘Can goodness compete?’” by Joe Carlsmith

“On being sort of back and sort of new here” by Loki zen

“Comment on ‘Four Layers of Intellectual Conversation’” by Zack_M_Davis

“Selective Generalization: Improving Capabilities While Maintaining Alignment” by ariana_azarbal, Matthew A. Clarke, jorio, Cailley Factor, cloud

“Bodydouble / Thinking Assistant matchmaking” by Raemon

“Kimi K2” by Zvi

“Grok 4 Various Things” by Zvi

“Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” by Tomek Korbak, Mikita Balesni, Vlad Mikulik, Rohin Shah

“Do confident short timelines make sense?” by TsviBT, abramdemski

[Linkpost] “LLM-induced craziness and base rates” by Kaj_Sotala

[Linkpost] “Bernie Sanders (I-VT) mentions AI loss of control risk in Gizmodo interview” by Matrice Jacobine

“Recent Redwood Research project proposals” by ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

“Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings” by Casey Barkan, Sid Black, Oliver Sourbut

“Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance” by Senthooran Rajamanoharan, Neel Nanda

“Worse Than MechaHitler” by Zvi

“How Does Time Horizon Vary Across Domains?” by Thomas Kwa

“xAI’s Grok 4 has no meaningful safety guardrails” by eleventhsavi0r

“Stop and check! The parable of the prince and the dog” by Dumbledore’s Army

“OpenAI Model Differentiation 101” by Zvi

“10x more training compute = 5x greater task length (kind of)” by Expertium

“Three Missing Cakes, or One Turbulent Critic?” by Benquo

“You can get LLMs to say almost anything you want” by Kaj_Sotala

“against that one rationalist mashal about japanese fifth-columnists” by Fraser

“Surprises and learnings from almost two months of Leo Panickssery” by Nina Panickssery

“Vitalik’s Response to AI 2027” by Daniel Kokotajlo

“the jackpot age” by thiccythot

“Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” by habryka

[Linkpost] “Guide to Redwood’s writing” by Julian Stastny

“So You Think You’ve Awoken ChatGPT” by JustisMills

[Linkpost] “Open Global Investment as a Governance Model for AGI” by Nick Bostrom

“what makes Claude 3 Opus misaligned” by janus

“Lessons from the Iraq War about AI policy” by Buck

“Generalized Hangriness: A Standard Rationalist Stance Toward Emotions” by johnswentworth

“Evaluating and monitoring for AI scheming” by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah

“White Box Control at UK AISI - Update on Sandbagging Investigations” by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

“80,000 Hours is producing AI in Context — a new YouTube channel. Our first video, about the AI 2027 scenario, is up!” by chanamessinger

[Linkpost] “No, We’re Not Getting Meaningful Oversight of AI” by Davidmanheim

“What’s worse, spies or schemers?” by Buck, Julian Stastny

“Applying right-wing frames to AGI (geo)politics” by Richard_Ngo

“No, Grok, No” by Zvi

“A deep critique of AI 2027’s bad timeline models” by titotal

“Subway Particle Levels Aren’t That High” by jefftk

“An Opinionated Guide to Using Anki Correctly” by Luise

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

“Balsa Update: Springtime in DC” by Zvi

[Linkpost] “A Theory of Structural Independence” by Matthias G. Mayer

“On Alpha School” by Zvi

“You Can’t Objectively Compare Seven Bees to One Human” by J Bostock

“Literature Review: Risks of MDMA” by Elizabeth

“45 - Samuel Albanie on DeepMind’s AGI Safety Approach” by DanielFilan

“On the functional self of LLMs” by eggsyntax

“Shutdown Resistance in Reasoning Models” by benwr, JeremySchlatter, Jeffrey Ladish

“The Cult of Pain” by Martin Sustrik

[Linkpost] “Claude is a Ravenclaw” by Adam Newgas

“‘Buckle up bucko, this ain’t over till it’s over.’” by Raemon

“How much novel security-critical infrastructure do you need during the singularity?” by Buck

“‘AI for societal uplift’ as a path to victory” by Raymond Douglas

“Two proposed projects on abstract analogies for scheming” by Julian Stastny

“Outlive: A Critical Review” by MichaelDickens

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout

[Linkpost] “MIRI Newsletter #123” by Harlan, Rob Bensinger

[Linkpost] “Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals” by Marius Hobbhahn

“Call for suggestions - AI safety course” by boazbarak

[Linkpost] “IABIED: Advertisement design competition” by yams

“Congress Asks Better Questions” by Zvi