Model Evaluation articles

11/13/2025 • EN

What happens if AI labs train for pelicans riding bicycles?

A humorous look at AI model benchmarking using the challenge of generating an SVG of a pelican riding a bicycle, and the risks of labs 'gaming' the test.

AI Training benchmarking generative ai Model Evaluation svg

Simon Willison

3/16/2025 • EN

Learnings from a week of building with local LLMs

A developer shares insights and practical tips from a week of experimenting with local LLMs, including model recommendations and iterative improvement patterns.

Gemma3 Local Llms Mistral Model Evaluation software development

Alex Strick van Linschoten

3/12/2024 • EN

Fine-Tune and Evaluate LLMs in 2024 with Amazon SageMaker

A technical guide on fine-tuning and evaluating open-source Large Language Models (LLMs) using Amazon SageMaker and Hugging Face libraries.

Amazon Sagemaker Hugging Face LLM Fine Tuning Model Evaluation Trl

Philipp Schmid

4/25/2022 • EN

Creating Confidence Intervals for Machine Learning Classifiers

A guide to creating confidence intervals for evaluating machine learning models, covering multiple methods to quantify performance uncertainty.

Confidence Intervals Machine Learning Model Evaluation performance metrics statistics

Sebastian Raschka

4/25/2022 • EN

Creating Confidence Intervals for Machine Learning Classifiers

A technical guide explaining methods for creating confidence intervals to measure uncertainty in machine learning model performance.

Confidence Intervals Machine Learning Model Evaluation performance metrics statistics

Sebastian Raschka

3/24/2022 • EN

TorchMetrics

Explains the difference between .update() and .forward() methods in the TorchMetrics library for evaluating PyTorch models.

Deep Learning metrics Model Evaluation Performance Tracking Pytorch

Sebastian Raschka

3/24/2022 • EN

TorchMetrics

Explains the difference between .update() and .forward() in TorchMetrics, a PyTorch library for tracking model performance during training.

Deep Learning metrics Model Evaluation Performance Tracking Pytorch

Sebastian Raschka

1/31/2022 • EN

The Modern ML Monitoring Mess: Research Challenges (4/4)

Final part of a series proposing a research agenda for ML monitoring, focusing on data management challenges like metric computation and real-time SLI tracking.

Data Management Machine Learning ML Monitoring Mlop Model Evaluation

Shreya Shankar

1/31/2019 • EN

A Deeper look at Mean Squared Error

A technical exploration of Mean Squared Error, breaking it down into bias and variance to understand model performance and irreducible uncertainty.

Bias Variance Tradeoff Machine Learning Mean Squared Error Model Evaluation statistics

Will Kurt

6/11/2016 • EN

Model evaluation, model selection, and algorithm selection in machine learning

A guide to model evaluation, selection, and algorithm comparison in machine learning to ensure models generalize well to new data.

Algorithm Selection Generalization Performance Machine Learning Model Evaluation Model Selection

Sebastian Raschka

6/11/2016 • EN

Model evaluation, model selection, and algorithm selection in machine learning

A guide to evaluating machine learning models, selecting the best models, and choosing appropriate algorithms to ensure good generalization performance.

Algorithm Selection Machine Learning Model Evaluation Model Selection Performance Estimation

Sebastian Raschka

Model Evaluation Articles

What happens if AI labs train for pelicans riding bicycles?

Learnings from a week of building with local LLMs

Fine-Tune and Evaluate LLMs in 2024 with Amazon SageMaker

Creating Confidence Intervals for Machine Learning Classifiers

Creating Confidence Intervals for Machine Learning Classifiers

TorchMetrics

TorchMetrics

The Modern ML Monitoring Mess: Research Challenges (4/4)

A Deeper look at Mean Squared Error

Model evaluation, model selection, and algorithm selection in machine learning

Model evaluation, model selection, and algorithm selection in machine learning

Select Language