LLM Evaluation articles

12/13/2025 • EN

What are AI Evals?

Explains AI evals: automated checks for non-deterministic AI outputs using LLMs to score against expectations, not exact matches.

AI Evals ai testing LLM Evaluation Non Deterministic Systems software testing

Nick Taylor

11/24/2025 • EN

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

Analysis of Claude Opus 4.5 LLM release and the growing difficulty in evaluating incremental improvements between AI models.

API Pricing Claude Opus LLM Evaluation Model Comparison software development

Simon Willison

11/23/2025 • EN

Product Evals in Three Simple Steps

A guide to building product evaluations for LLMs using three steps: labeling data, aligning evaluators, and running experiments.

AI Alignment Data Labeling Evaluation Harness Experiment Design LLM Evaluation

Eugene Yan

10/15/2025 • EN

AI Agent Benchmark Compendium

A comprehensive overview of over 50 modern AI agent benchmarks, categorized into function calling, reasoning, coding, and computer interaction tasks.

AI Agents Benchmarks Function Calling LLM Evaluation Tool Use

Philipp Schmid

10/5/2025 • EN

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

Explores four main methods for evaluating Large Language Models (LLMs), including code examples for implementing each approach from scratch.

benchmarking Fine Tuning LLM Evaluation Model Comparison Reasoning Models

Sebastian Raschka

10/5/2025 • EN

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

A guide to the four main methods for evaluating Large Language Models, including code examples and practical implementation details.

benchmarking Fine Tuning LLM Evaluation Model Comparison Reasoning Models

Sebastian Raschka

6/22/2025 • EN

Evaluating Long-Context Question & Answer Systems

Explores challenges and methods for evaluating question-answering AI systems when processing long documents like technical manuals or novels.

benchmarking Information Retrieval LLM Evaluation Long Context Question Answering

Eugene Yan

4/20/2025 • EN

An LLM-as-Judge Won't Save The Product—Fixing Your Process Will

Argues that effective AI product evaluation requires a scientific, process-driven approach, not just adding LLM-as-judge tools.

AI Engineering data analysis LLM Evaluation product development Scientific Method

Eugene Yan

10/27/2024 • EN

AlignEval: Building an App to Make Evals Easy, Fun, and Automated

Introduces AlignEval, an app for building and automating LLM evaluators, making the process easier and more data-driven.

ai testing Automated Evals Langsmith LLM Evaluation LLM Evaluator

Eugene Yan

9/24/2024 • EN

Evaluate open LLMs with Vertex AI and Gemini

A technical guide on using Google's Vertex AI Gen AI Evaluation Service with Gemini to evaluate open LLM models like Llama 3.1.

Gemini LLM Evaluation Model Deployment Open Source Llms Vertex AI

Philipp Schmid

9/22/2024 • EN

Weights & Biases LLM-Evaluator Hackathon - Hackathon Judge

Author judges a Weights & Biases hackathon focused on building LLM evaluation tools, discussing key considerations and project highlights.

AI Evaluation hackathon Judging large language models LLM Evaluation

Eugene Yan

9/19/2024 • EN

Evaluate LLMs using Evaluation Harness and Hugging Face TGI/vLLM

A guide to evaluating Large Language Models (LLMs) using the Evaluation Harness framework and optimized serving tools like Hugging Face TGI and vLLM.

benchmarking Evaluation Harness LLM Evaluation Text Generation Inference Vllm

Philipp Schmid

8/18/2024 • EN

Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

A survey of using LLMs as evaluators (LLM-as-Judge) for assessing AI model outputs, covering techniques, use cases, and critiques.

AI Evaluation large language models LLM As Judge LLM Evaluation Model Assessment

Eugene Yan

7/11/2024 • EN

LLM Evaluation doesn't need to be complicated

A guide to simplifying LLM evaluation workflows using clear metrics, chain-of-thought, and few-shot prompts, inspired by real-world examples.

AI Applications Chatbot generative ai large language models LLM Evaluation

Philipp Schmid

7/1/2024 • EN

Data Flywheels for LLM Applications

A framework for building data flywheels to dynamically improve LLM applications through continuous evaluation, monitoring, and feedback loops.

Data Flywheels LLM Applications LLM Evaluation LLM Monitoring Llmops

Shreya Shankar

6/28/2024 • EN

Evaluating Open LLMs with MixEval: The Closest Benchmark to LMSYS Chatbot Arena

Introduces MixEval, a cost-effective LLM benchmark with high correlation to Chatbot Arena, for evaluating open-source language models.

benchmark Chatbot Arena large language models LLM Evaluation open source

Philipp Schmid

4/8/2024 • EN

Comparing LLMs on "Real-World" Retrieval

A developer compares 8 LLMs on a custom retrieval task using medical transcripts, analyzing performance on simple to complex questions.

Data Wrangling Instruction Tuning LLM Evaluation Model Comparison Retrieval Benchmarks

Shreya Shankar

3/31/2024 • EN

Task-Specific LLM Evals that Do & Don't Work

A guide to effective and ineffective evaluation methods for LLMs on tasks like classification, summarization, and translation, including practical metrics.

classification LLM Evaluation Summarization Toxicity Translation

Eugene Yan

3/5/2024 • EN

Evaluate LLMs with Hugging Face Lighteval on Amazon SageMaker

A tutorial on evaluating Large Language Models using Hugging Face's Lighteval library on Amazon SageMaker, focusing on benchmarks like TruthfulQA.

Amazon Sagemaker benchmarking Hugging Face Lighteval LLM Evaluation

Philipp Schmid

3/4/2024 • EN

Measuring and Mitigating Hallucinations in Large Language Models: A Multifaceted Approach

A technical paper exploring the causes, measurement, and mitigation strategies for hallucinations in Large Language Models (LLMs).

AI Safety Hallucination Mitigation large language models LLM Evaluation Model Alignment

Xavier Amatriain

LLM Evaluation Articles

What are AI Evals?

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

Product Evals in Three Simple Steps

AI Agent Benchmark Compendium

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

Evaluating Long-Context Question & Answer Systems

An LLM-as-Judge Won't Save The Product—Fixing Your Process Will

AlignEval: Building an App to Make Evals Easy, Fun, and Automated

Evaluate open LLMs with Vertex AI and Gemini

Weights & Biases LLM-Evaluator Hackathon - Hackathon Judge

Evaluate LLMs using Evaluation Harness and Hugging Face TGI/vLLM

Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

LLM Evaluation doesn't need to be complicated

Data Flywheels for LLM Applications

Evaluating Open LLMs with MixEval: The Closest Benchmark to LMSYS Chatbot Arena

Comparing LLMs on "Real-World" Retrieval

Task-Specific LLM Evals that Do & Don't Work

Evaluate LLMs with Hugging Face Lighteval on Amazon SageMaker

Measuring and Mitigating Hallucinations in Large Language Models: A Multifaceted Approach

Select Language

We use cookies