Benchmarks articles

12/4/2025 • EN

#AI horizons 25-11 – Kimi K2 Thinking and the New AI Balance of Power

Analysis of China's Kimi K2 Thinking AI model, a low-cost, open-weight model challenging US dominance in reasoning and agentic tasks.

artificial intelligence Benchmarks Model Training Open Weight Models Reasoning Models

Daniele Grandini

11/7/2025 • EN

Kimi K2 Thinking

Moonshot AI's Kimi K2 Thinking is a 1 trillion parameter open-weight model optimized for multi-step reasoning and long-running tool calls.

AI Agents Benchmarks large language models Model Quantization Tool Use

Simon Willison

11/6/2025 • EN

Quoting Nathan Lambert

Analysis of the rising prominence of Chinese AI labs like DeepSeek and Kimi in the global AI landscape and their rapid technological advancements.

ai development Benchmarks Chinese AI Labs Foundation Models Machine Learning

Simon Willison

10/15/2025 • EN

AI Agent Benchmark Compendium

A comprehensive overview of over 50 modern AI agent benchmarks, categorized into function calling, reasoning, coding, and computer interaction tasks.

AI Agents Benchmarks Function Calling LLM Evaluation Tool Use

Philipp Schmid

8/9/2025 • EN

The Worst AI Metric

Critique of the 'how many r's in strawberry' test as a poor benchmark for AI intelligence, arguing it measures irrelevant trivia.

ai artificial intelligence Benchmarks Language Models metrics

Daniel Miessler

6/17/2025 • EN

Language model benchmarks only tell half a story

Explains why standard language model benchmarks are insufficient and how to build custom benchmarks for specific application needs.

Benchmarks Dev Proxy Language Models Ollama Openai API

Waldek Mastykarz

1/21/2025 • EN

Notes on ‘AI Engineering’ (Chip Huyen) chapter 3

Summarizes key challenges and methods for evaluating open-ended responses from large language models and foundation models, based on Chip Huyen's book.

AI Evaluation Benchmarks Foundation Models Language Model Metrics llm

Alex Strick van Linschoten

9/10/2020 • EN

Ordered Dictionaries

Compares Python's OrderedDict vs standard dict performance, explaining when and why OrderedDict is still useful.

Benchmarks Collections Dictionaries Ordereddict Python

Sebastian Witowski

9/9/2017 • EN

Faster and safer Haskell - benchmarks for the accumulating parameter

Benchmarks comparing Haskell list length implementations, showing how strict tail recursion with accumulating parameters improves performance and memory safety.

Accumulating Parameter Benchmarks Haskell recursion Tail Recursion

Marcelo Lazaroni