In 2025, seven-month-old startup Axiom solved all 12 of the problems Putnam exam (scoring 8/12 in the time limit) a prestigious undergraduate math exam. The 12/12 score is better than the top undergraduates (110/120) and the closest AI system that reported a result (DeepSeek 103/120), although it is unclear what the people and other systems would have scored with more time. Nonetheless, the Putnam exam is legendary for its difficulty, with the median score typically being 0 or 1 points. Taken by itself, this seems like a minor feather in the cap of AI; one of a long series of accomplishments by AI systems in elite competitions with humans, starting with Deep Blue beating Kasparov.
Fast forward to mid-2026, and Claude Code is eating the world. In 2024 Anthropicâs bet on code and enterprise looked like a more pragmatic niche play vs. OpenAIâs better models and massive consume scale. Today, Amodeiâs all in bet on acceleration via code (images and video be damned) seems prescient.
Despite Anthropicâs growing momentum, however, Axiom CEO Carina Hong sees coding ability as a necessary but not sufficient milestone on the path to AGI. Code arguably pushes the jagged frontier to the point of super intelligence in some domains outside of coding, but there are surprising gaps (link) that Carina believes will bottleneck AI progress. (Stats on math benchmarks).
The informal bottleneck
âVerified AIâ sounds like eating broccoli (footnote: I actually love broccoli, but then again, I also believe strongly in Test Driven Development, so ÂŻ\(ă)/ÂŻ ) and paying taxes, but to Axiom it means something very different. âVerification to me is about scaling brilliance, compounding brilliance,â Carina told us.
It actually took a while for me to understand what she means by this. It sounded like marketing-speak to me, until it clicked. Carina emphasizes an story about legendary mathematician Srinivasa Ramanujan to illustrate the point. When G.H. Hardy finally persuaded Ramanujan to formally prove theorems instead of relying on his (formidable) intuition, it reportedly improved his own capabilities. This is presumably because formally proving things forced Ramanujan to articulate the details in a way that open up new lines of thinking, etc. This is one part of âcompounding.â
But formally proving things also allowed others to benefit from his intuition: the proofs are way of communicating an intuition and persuading others that the intuition is correct. This is scaling (more people use the result) and compounding (people can learn from and build on his work).
This is the analogy that Carina wants us to focus on.
Verified Generation
There are two ways that Verified AI shows up: in training and in inference.
But a quick detour: to a first approximation, âFormal Verificationâ means using type checkers (like for TypeScript, C++ or Rust, but more capable) to verify mathematical proofs that are meticulously specified using a language like Lean (footnote: Formal verification also includes model checking (TLA+, SPIN), SMT-based tools (Dafny, F*, Why3), and refinement-type systems (Liquid Haskell) â many of which donât look much like âtype checking a proofâ from the userâs perspective even when thereâs a similar logical core underneath. It also gets applied to software and hardware correctness, not only pure mathematics.). It takes a lot of work to translate an âinformalâ proof (albeit one that most people would not remotely call âinformalâ) in to a Lean proof (footnote: This is an understatement. Most theorems remain informal because formalization is so hard to do. There has been a great deal of effort to formalize the most important proofs, with mixed results)
You can imagine how this would be (very) useful during Reinforcement Learning: instead of relying on best guesses based on statistics (GRPO, RLHF, etc.), you can just verify the proof is correct using a Lean verifier. This is obviously a much stronger reward signal, akin to compiling code and testing it (which is what is typically done with RL on coding).
The catch: LLM are not (currently) very good at proving things with Lean.
Enter Axiom: While they have not officially reported benchmark numbers besides the 12/12 Putnam result, Carina reports that they have achieved a very impressive 99% (187/189) ProofGen on the Verina benchmark. This benchmark is to generate code and proof of correctness for a series of problems. For context, OpenAI o3 (the last known OpenAI run) achieved 4.9% on this benchmark.
Based on the sparse benchmarking, itâs hard to say what the frontier labs are currently doing, but Carina suggests that they still are not training to generate Lean proofs directly, rather relying on informal proofs.
Time will tell if the frontier labsâ current approaches will close this gap.
Scaling and compounding
Carinaâs Ramanujan analogy is pretty direct. Better proofs â better Lean generation â better RL. A stronger signal means higher sample efficiency and higher maximum performance. Great!
Scaling is pretty clear too: once I have proved something in Lean, the quality of the output is basically (footnote: one might argue that its a bit lower because the proof is in distribution for the LLM) as high as if it came from a human, so my high quality training set has grown in a way that an informal rollout corpus cannot. I can trust my Lean proofs.
Compounding is also clear: now all of future inference and training can build upon those proofs.
On the other hand, a model trained only using statistical signals like GRPO during RL lacks the sample efficiency, maximum performance and compounding corpus that a system that uses formal verification benefits from.
All roads lead to verification
Broccoli and taxes notwithstanding, âverificationâ has shown up in a lot of conversations recently. In the in physical system control:
âI think [verifiability] is probably the hardest problem right now, because the as the models get better, it can be harder and harder to find the faults on the system. And so the problem of doing proper eval to find those faults, that problem also keeps getting harder as the models get better.â -
In theoretical physics:
ââŠnow that weâre in this regime where you can just get ChatGPT to tackle thousands of questions at the same time, it will return proofs for a significant fraction of them. Now actually the onus is back on the humans to verify all the outputs. And so, yeah, as that becomes a bottleneck, I think formalizing math and automating verification will become more valuable.â -
Verification is, in fact, the key differences between AI for science and AI for computation: in science you to have to actually test (verify) your hypothesis by performing physical experiments. Lab in the loop systems like Radical AI and Lila build around exactly this premise (we have recorded episodes with both of these teams and will release them soon!)
And yes, formally verifying critical systems such as flight control, nuclear power plants and pacemakers is a growing focus as the software and hardware that run them becomes more complex.
Carina believes so strongly that AGI requires verified generation that she makes the unqualified claim that âWe do not believe there is any other possible future.â
Expensive to produce, cheap to verify
Lean proofs are hard generate, but they can be easily shown to be correct or incorrect. But how do you know that the proof you created maps correctly to the problem you care about? As Carina puts it: âAnything that can be specified can be proven. Humans are bad at specifying everything we want.â
Are we now in the specification business? Check out the episode to hear Carinaâs take, as well as:
* Why hardware verification is a killer app
* Details on the AXLE open API and recently released Discovery toolkit
* The Erdos debacle
* The OpenAI GPT-f diaspora
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Fler avsnitt av Latent Space: The AI Engineer Podcast
Visa alla avsnitt av Latent Space: The AI Engineer PodcastLatent Space: The AI Engineer Podcast med Latent.Space finns tillgÀnglig pÄ flera plattformar. Informationen pÄ denna sida kommer frÄn offentliga podd-flöden.
