Do predictive models need to be causal?
Explores whether predictive statistical models require causal relationships to be useful, using examples from data sampling and real-world scenarios.
Thomas Lumley writes thoughtful, in-depth articles on statistics, data analysis, and statistical modeling. His blog explores topics like survey methods, regression, simulations, and inference with a rigorous yet reflective approach.
215 articles from this blog
Explores whether predictive statistical models require causal relationships to be useful, using examples from data sampling and real-world scenarios.
A technical discussion comparing two classes of multiparameter tests in survey statistics, focusing on the Rao-Scott tests and intrinsically-weighted tests for regression models.
Explores a robust location estimator (Tukey's shorth) through simulation, examining its asymptotic normality and efficiency compared to the mean and median.
Discusses handling class imbalance in predictive modeling, using medical and zebra analogies to explain adjusting for prior probabilities and error costs.
Explains that svyglm uses robust standard errors, detailing the statistical theory and variance estimation for survey data.
A lecture on the foundational statistical concept of orderings and ordinal data, exploring their analysis and complications in fields like health research.
Explores the limitations of using large language models as substitutes for human opinion polling, highlighting issues of representation and demographic weighting.
Explains why AIC comparisons between discrete and continuous statistical models are invalid, using examples with binomial and Normal distributions.
Explains the statistical concept of included-variable bias in regression models, challenging the traditional 'omitted-variable bias' framing.
A technical explanation of the Two-Stage Least Squares (2SLS) method for causal inference in regression, covering its derivation and variance estimation.
A technical analysis using R to classify iris images from a dataset, applying PCA and LDA for machine learning classification.
Explores techniques for generating identical random number streams across different statistical models, focusing on coupling simulations for Bayesian adaptive trials.
Explores the challenges of analyzing ordinal data, focusing on transformation invariance and the limitations of statistical comparisons.
Analyzes four datasets with high collinearity between predictors, demonstrating statistical diagnostics and modeling approaches using R.
A statistical analysis of a classic 1986 dataset, demonstrating how modern displays make hidden structures visible without complex methods.
Compares Satterthwaite, Liu, and leading-term approximations for tail probabilities of weighted sums of chi-squared variables in high-dimensional genomic data.
A technical analysis using R and the DHBins package to visualize New Zealand's National Land Transport Plan expenditure data via hexmaps.
Announcing the 2024 Ihaka Lectures series, featuring talks on literate programming, data journalism, and using R in government.
Explores missing likelihood-ratio tests in survey regression models, comparing Wald, score, and Rao-Scott tests with sample vs. population scaling.
Explores challenges and algorithms for weighted sampling without replacement in R, focusing on achieving specified marginal probabilities.