LessWrong (30+ Karma)

“Claude 4 You: The Quest for Mundane Utility” by Zvi

38 min • 27 maj 2025

How good are Claude Opus 4 and Claude Sonnet 4?

They’re good models, sir.

If you don’t care about price or speed, Opus is probably the best model available today.

If you do care somewhat, Sonnet 4 is probably best in its class for many purposes, and deserves the 4 label because of its agentic aspects but isn’t a big leap over 3.7 for other purposes. I have been using 90%+ Opus so I can’t speak to this directly. There are some signs of some amount of ‘small model smell’ where Sonnet 4 has focused on common cases at the expense of rarer ones. That's what Opus is for.

That's all as of when I hit post. Things do escalate quickly these days, although I would not include Grok in this loop until proven otherwise, it's a three horse race and if you told me [...]

---

Outline:

(01:17) On Your Marks

(05:32) Standard Silly Benchmarks

(11:09) API Upgrades

(12:45) Coding Time Horizon

(13:47) The Key Missing Feature is Memory

(14:52) Early Reactions

(26:12) Opus 4 Has the Opus Nature

(32:27) Unprompted Attention

(35:09) Max Subscription

(36:24) In Summary

---

First published:
May 26th, 2025

Source:
https://www.lesswrong.com/posts/cQizFzEvZ8esKJGST/claude-4-you-the-quest-for-mundane-utility

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Question and response about comparing weights of feathers versus bricks.
Meme showing major AI companies claiming
Bar graph comparing AI model performance scores, ranging from 30% to 100%.
Two identical cartoon astronauts, each balancing a brown horse overhead.
Line graph
Benchmark comparison table showing Claude 4 models versus other AI models.
Screenshot of a classic riddle about a surgeon and patient, with explanation underneath.

This image shows an important discussion about gender bias in professional assumptions. It presents a riddle designed to reveal unconscious stereotypes about medical professions, followed by a thoughtful analysis of why the puzzle is effective at demonstrating implicit biases.
Four maps showing different representations of Europe and land masses, colored shapes.

This appears to be a comparison of different cartographic or visualization styles, with two maps of Europe shown in the top panels (using colored polygons to represent countries) and two simplified landmass shapes in the bottom panels (one in gray, one in green).
Bingo card showing common AI company release announcement clichés and patterns.
Bingo card titled
Benchmark comparison chart showing performance and costs of AI coding models.

The chart titled
Yellow peace sign or victory hand gesture emoji
Yellow peace sign or victory hand gesture emoji

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Senaste avsnitt

Podcastbild

00:00 -00:00
00:00 -00:00