Evaluating New LLMs: Challenges and Considerations for Developers

https://simonwillison.net/2025/Nov/24/claude-opus/#atom-everything(simonwillison.net)

Submitted by alonkatz•11/29/2025

🤖 AI Summary85% confidence

The article discusses the release of Anthropic's Claude Opus 4.5, highlighting its improvements over previous models and the challenges in evaluating new LLMs. The author reflects on the difficulty of identifying concrete advancements in capabilities between new and existing models.

Key insight: New model releases should include concrete examples of tasks they can solve that previous models could not.

Technique: Prompt Injection Robustness

Claude Opus 4.5Claude Sonnet 4.5Gemini 3 ProGPT-5.1-Codex-Max

Comments (0)

You need to be signed in to comment.

No comments yet. Be the first to comment!