Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult
Analysis of Claude Opus 4.5 LLM release and the growing difficulty in evaluating incremental improvements between AI models.
SimonWillison.net is the long-running blog of Simon Willison, a software engineer, open-source creator, and co-author of the original Django framework. He writes about Python, Django, Datasette, AI tooling, prompt engineering, search, databases, APIs, data journalism, and practical software architecture. The blog includes detailed notes from experiments, conference talks, and real projects. Readers will find clear explanations of topics such as LLM workflows, SQL patterns, data publishing, scraping, deployment, caching, and modern developer tooling. Simon also publishes frequent micro-posts and TIL entries that document small discoveries and tricks from day-to-day engineering work. The tone is practical and research oriented, making the site a valuable resource for anyone interested in serious engineering and open data.
213 articles from this blog
Analysis of Claude Opus 4.5 LLM release and the growing difficulty in evaluating incremental improvements between AI models.
Release notes for sqlite-utils 3.39, featuring bug fixes for plugin installation with uv and new functionality for custom SQL functions.
Announcing sqlite-utils 4.0a1, a Python library and CLI for SQLite, detailing minor backwards incompatible changes before the stable release.
Analysis of how engineering management trends shift with business cycles, highlighting core skills that remain constant.
Armin Ronacher discusses challenges in AI agent design, including abstraction issues, testing difficulties, and API synchronization problems.
Olmo 3 is a new fully open-source large language model from AI2, featuring training data, code, and unique interpretability for reasoning traces.
Explains dependency cooldowns, a strategy to reduce supply chain attack risk by delaying automatic dependency updates.
Analysis of Google's new Nano Banana Pro image generation model, covering its advanced features, API pricing, and real-world testing results.
Explores how LLMs could enable malware to find personal secrets for blackmail, moving beyond simple ransomware attacks.
OpenAI releases GPT-5.1-Codex-Max, a new AI model focused on agentic coding tasks, featuring advanced context compaction for long-running work.
A developer explains his automated workflow using SQL, Datasette, and Observable to generate a Substack newsletter from his blog content.
Analysis of a major Cloudflare outage caused by a database permissions change and software panic, quoting CEO Matthew Prince.
New release of the llm-gemini plugin adds support for nested Pydantic schemas, YouTube URL attachments, and the latest Gemini 3 Pro model.
MacWhisper's new Automatic Speaker Recognition feature, powered by NVIDIA Parakeet, accurately identifies speakers in audio transcripts.
Google Antigravity is a new AI-powered IDE that integrates with Gemini models for agentic coding, featuring browser testing and automated documentation.
Ethan Mollick reflects on AI's rapid evolution from chatbots to digital coworkers, highlighting the changing role of human oversight.
A hands-on review of Google's new Gemini 3 Pro AI model, covering its features, benchmarks, pricing, and testing its multimodal capabilities.
Discusses the future of small open source libraries in the age of LLMs, questioning their relevance when AI can generate specific code.
Explores Andrej Karpathy's concept of Software 2.0, where AI writes programs through objectives and gradient descent, focusing on task verifiability.
Release of llm-anthropic plugin 0.22 with support for Claude's structured outputs and web search tool integration.