A bus-watching bot
A developer explains their Twitter bot that monitors and visualizes Auckland bus delays using GTFS data and R packages.
Thomas Lumley writes thoughtful, in-depth articles on statistics, data analysis, and statistical modeling. His blog explores topics like survey methods, regression, simulations, and inference with a rigorous yet reflective approach.
215 articles from this blog
A developer explains their Twitter bot that monitors and visualizes Auckland bus delays using GTFS data and R packages.
A developer discusses the trade-offs between writing simple, clear code and optimizing for performance, using a real-world example of inefficient vector growth in R.
The author discusses a potential bug or design quirk in the 'leaps' R package related to how forward/backward selection interacts with its exhaustive search preprocessing.
Analyzing the Monty Hall problem, exploring learning strategies and optimal decisions based on observed game history and host behavior.
Critique of the classic iris dataset as a misleading example in modern machine learning education, exploring its original scientific purpose.
Explores computational challenges of large quadratic forms in genomics, focusing on eigenvalue approximations for high-dimensional statistical tests like SKAT.
Using R code to generate permutations of digits (2,2,5,5,9,9), analyzing divisibility by 11 and primality.
Explores Bayesian vs. Frequentist approaches to the multiple comparisons problem in statistical inference and data analysis.
Discusses why simulation summaries should focus on quantiles and robust statistics rather than moments when evaluating asymptotic approximations.
The author reflects on R's rise in programming language rankings and its unexpected adoption across diverse fields over 20 years.
Explores various mathematical proofs for the Central Limit Theorem, comparing approaches like characteristic functions, the Lindeberg trick, entropy, and moments.
Explains how to compute the Huber/White sandwich estimator incrementally in R's biglm package for large-scale linear regression.
Explores why modern neural networks succeed where older ones failed, emphasizing the critical role of massive computational power and data size.
Explores the surprising science behind cheap gas-sensitive resistors and their ability to detect molecules like acetone, bridging chemistry and electronics.
Explores the surprising effectiveness and conservative nature of the Bonferroni correction for multiple hypothesis testing, even with many tests.
Explores Hutchinson's randomized trace estimator for efficiently approximating the trace of large matrices, with practical improvements.
Explains linear splines, their mathematical basis, and two practical parametrizations for regression, comparing them to higher-degree splines.
Explains the Stochastic SVD algorithm, a probabilistic method for fast, approximate matrix decomposition using random projections.
A data analysis of a radio station's song rotation patterns using vector math and statistical methods to test anecdotal claims about repetitive playtimes.
A statistical analysis comparing large and small model estimators, focusing on efficiency and misspecification testing in regression contexts.