Thomas Lumley

Thomas Lumley writes thoughtful, in-depth articles on statistics, data analysis, and statistical modeling. His blog explores topics like survey methods, regression, simulations, and inference with a rigorous yet reflective approach.

https://notstatschat.rbind.io

RSS Feed

1/25/2026

statistics data analysis statistical modeling applied mathematics research methods

Articles from this Blog

215 articles from this blog

9/10/2019 • EN

(What’s up with the brackets?)

Explains why parentheses cause R code assignments to print their values, covering invisibility flags and the behavior of the `(` function.

programming assignment R

9/5/2019 • EN

A package for multiple-response data

Introducing the 'rimu' R package for manipulating and analyzing multiple-response data, with examples using ethnicity survey data.

Data Manipulation R R Package

7/16/2019 • EN

Adding new functions to the survey package

A technical guide on extending the R survey package to support instrumental variable regression (ivreg) with complex survey data.

R Statistical Computing R Package

6/26/2019 • EN

Denominator degrees of freedom in svyglm

A technical note on calculating denominator degrees of freedom in survey-weighted generalized linear models (svyglm) for complex sample designs.

Statistical Modeling Survey Analysis Variance Estimation

6/20/2019 • EN

Wald, score, LRT: the picture

Explains the relationship between Wald, score, and likelihood ratio tests in statistical modeling using visual diagrams and R code examples.

R Programming Hypothesis Testing Statistical Inference

6/16/2019 • EN

Analysing the mouse microbiome autism data

A statistical re-analysis of a published study on the mouse microbiome and autism, examining data and p-values from behavioral experiments.

data analysis statistics R

6/11/2019 • EN

Confidence intervals: not a very strong property

A statistical analysis discussing the limitations of confidence intervals, using examples from small-area sampling to illustrate their weak properties.

data analysis statistics sampling

5/24/2019 • EN

Mean People Tweet

Analyzing tweet sentiment towards public figures using R, word embeddings, and logistic regression models to measure online negativity.

twitter api Logistic Regression Natural Language Processing

4/30/2019 • EN

Local asymptotic minimax, and nearly-true models

Explores statistical efficiency of estimators in nearly-true regression models under two-phase sampling, focusing on local asymptotic minimax theory.

Regression Models Statistical Inference Asymptotic Theory

4/21/2019 • EN

Handling ‘plausible values’ in surveys

A technical guide on handling 'plausible values' in survey data analysis using R, including code for the survey package.

R Statistical Analysis Survey Data

4/19/2019 • EN

Progress on linear mixed models for surveys

Explores challenges in applying weighted penalized least squares to linear mixed models for survey data, highlighting estimation issues.

Statistical Modeling Survey Sampling Variance Estimation

3/4/2019 • EN

Normal horizontiles

A technical analysis verifying a statistical calculation from an XKCD comic, involving normal distribution probabilities and R code.

statistics Integration Probability

3/1/2019 • EN

Displaying bus punctuality

A technical analysis of bus punctuality using Auckland Transport API data, with R code for data processing and visualization.

api data analysis statistics

2/18/2019 • EN

Absolutely no warranty?

A comparison of warranty disclaimers in statistical software licenses, focusing on R, SAS, Stata, and SPSS, and their implications for users.

open source R Statistical Software

2/9/2019 • EN

What have I got against the Shapiro-Wilk test?

A critique of the Shapiro-Wilk normality test, arguing it's often misused due to the Central Limit Theorem and is rarely the scientifically relevant question.

Statistical Testing Normal Distribution Shapiro Wilk Test

2/1/2019 • EN

Recognising when you don’t know

Explores the challenge of machine learning models recognizing 'unknown' inputs, using mushroom classification as an example.

Machine Learning classification Xgboost

1/26/2019 • EN

Two quick survey items

Explores optimal sampling design for logistic regression in case-control studies, analyzing Neyman allocation and two-phase sampling variances.

Logistic Regression Statistical Sampling Influence Functions

1/18/2019 • EN

Another way to see why mixed models in survey data are hard:

Explores the statistical challenges of applying linear mixed models to complex survey data with multi-stage sampling, focusing on weighting issues.

Linear Regression R Statistical Computing

1/11/2019 • EN

The Ihaka Lectures 3: Rise of the Machine Learners

Announcement for a lecture series on machine learning, covering topics like Weka, deep learning, algorithmic fairness, and sparse supervised learning.

Machine Learning statistics Data Science