Statistics articles

12/1/2016 • EN

Linear regression in the wild

A data scientist shares a technical interview task on linear regression, covering data cleaning, model fitting, and assumption validation.

Data Science Linear Regression Python Scikit Learn statistics

Yoel Zeldes

9/27/2016 • EN

Large quadratic forms

Explores computational challenges of large quadratic forms in genomics, focusing on eigenvalue approximations for high-dimensional statistical tests like SKAT.

Eigenvalues Genomics linear algebra Quadratic Forms statistics

Thomas Lumley

9/9/2016 • EN

Who wants to be a coder

Analyzing the relationship between age and desired job roles among new coders using the 2016 Kaggle survey data.

data analysis Kaggle programming Python statistics

Yoel Zeldes

9/6/2016 • EN

On permuting all the things

Using R code to generate permutations of digits (2,2,5,5,9,9), analyzing divisibility by 11 and primality.

combinatorics prime numbers Probability R statistics

Thomas Lumley

8/14/2016 • EN

Simulations and modes of convergence

Discusses why simulation summaries should focus on quantiles and robust statistics rather than moments when evaluating asymptotic approximations.

Asymptotics Convergence Maximum Likelihood simulation statistics

Thomas Lumley

7/28/2016 • EN

One scoRe years

The author reflects on R's rise in programming language rankings and its unexpected adoption across diverse fields over 20 years.

data analysis programming languages R Software Rankings statistics

Thomas Lumley

6/4/2016 • EN

Computing the (simplest) sandwich estimator incrementally

Explains how to compute the Huber/White sandwich estimator incrementally in R's biglm package for large-scale linear regression.

Incremental Computation Linear Regression R Sandwich Estimator statistics

Thomas Lumley

3/20/2016 • EN

The conservative Bonferroni correction

Explores the surprising effectiveness and conservative nature of the Bonferroni correction for multiple hypothesis testing, even with many tests.

Bonferroni Correction Confidence Intervals Multiple Testing statistics Type I Error

Thomas Lumley

3/15/2016 • EN

Data science intro for math/phys background

A guide for academics with math/physics backgrounds transitioning into data science, covering skills, learning paths, and practical advice.

Data Science data visualization Machine Learning Python statistics

Piotr Migdał

1/20/2016 • EN

Is it that time of day?

A data analysis of a radio station's song rotation patterns using vector math and statistical methods to test anecdotal claims about repetitive playtimes.

data analysis data visualization statistics Time Series Vector Analysis

Thomas Lumley

1/13/2016 • EN

What does ‘design-consistent’ even mean?

Explores the statistical concept of 'design consistency' in survey sampling, comparing it to model consistency and discussing asymptotic theory.

Asymptotics Design Consistency Estimation Model Consistency statistics

Thomas Lumley

12/14/2015 • EN

A simple probability problem

Analyzing a classic probability problem involving dice rolls, its historical context with Newton and Pepys, and the mathematical intuition behind it.

Binomial Distribution data analysis mathematics Probability statistics

Thomas Lumley

9/22/2015 • EN

NZ Flag Referendum pseudorandom numbers

Analyzes the pseudorandom number generator defined in NZ Flag Referendum law, comparing it to the Wichmann-Hill algorithm and noting a potential flaw.

algorithm Legislation Pseudorandom Number Generator statistics Wichmann Hill

Thomas Lumley

9/14/2015 • EN

Good reasons for assuming a spherical cow

Explores valid reasons for using simplified assumptions like 'spherical cows' in statistical modeling and theoretical work.

Assumptions Computational Methods Modeling statistics Theory

Thomas Lumley

8/29/2015 • EN

Net Reclassification Index: surprisingly weird.

A technical critique of the Net Reclassification Index (NRI), a statistical measure for evaluating prediction model improvements, highlighting its surprising biases.

Biostatistics classification Net Reclassification Index prediction models statistics

Thomas Lumley

6/20/2015 • EN

A much-needed gap

Critique of using Shapiro-Wilk normality tests on large, complex survey data like NHANES, explaining why it's statistically inappropriate.

data analysis Normality Testing Sampling Methodology Shapiro Wilk Test statistics

Thomas Lumley

5/3/2015 • EN

What’s the right proof of the Continuous Mapping Theorem?

Explores different proofs of the Continuous Mapping Theorem in probability theory, discussing their merits and pedagogical value.

Asymptotics Continuous Mapping Theorem Convergence In Distribution Probability Theory statistics

Thomas Lumley

3/29/2015 • EN

Reading citations is easier than most people think

The article debunks common misinterpretations of the Dunning-Kruger effect by analyzing the original study's data and findings.

data analysis research methodology scientific studies statistics

Dan Luu

3/7/2015 • EN

What does measurability mean?

A philosophical and technical exploration of the practical meaning of measurability in mathematical statistics, questioning its necessity for real-world data analysis.

Asymptotic Theory Mathematical Proofs Measurability Probability Theory statistics

Thomas Lumley