AI Evaluation articles

3/13/2026 • EN

Evaluating Claude's dbt Skills: Building an Eval from Scratch

Testing Claude Code's ability to build a production-ready dbt project for a data pipeline, evaluating prompts and skills.

AI Evaluation Claude Data Engineering Dbt llm

Robin Moffatt

3/10/2026 • EN

Benchmarking Culture

A critique of how quantitative benchmarking and evaluation culture shapes and potentially distorts progress in machine learning research.

AI Evaluation artificial intelligence benchmarking Machine Learning Research Culture

Ben Recht

10/15/2025 • EN

Serious Data From Testing LLMs

A data-driven analysis of LLM performance on a simple retrieval task, highlighting the need for evidence-based AI testing.

AI Evaluation data analysis Experimental Design LLM Testing mongodb

James Bach

9/5/2025 • EN

In Defense of AI Evals, for Everyone

A defense of systematic AI evaluation (evals) in development, arguing they are essential for measuring application quality and improving models.

AI Evaluation Machine Learning Model Training Quality Assurance software development

Shreya Shankar

5/23/2025 • EN

Error analysis to find failure modes

A summary of a practical session on analyzing and improving LLM applications by identifying failure modes through data clustering and iterative testing.

AI Evaluation Error Analysis Failure Modes llm Machine Learning

Alex Strick van Linschoten

1/21/2025 • EN

Notes on ‘AI Engineering’ (Chip Huyen) chapter 3

Summarizes key challenges and methods for evaluating open-ended responses from large language models and foundation models, based on Chip Huyen's book.

AI Evaluation Benchmarks Foundation Models Language Model Metrics llm

Alex Strick van Linschoten