Statistics articles

6/10/2021 • EN

Causal inference 4: Causal Diagrams, Markov Factorization, Structural Equation Models

Explores the relationship between causal and statistical models, focusing on causal diagrams, Markov factorization, and structural equation models.

Causal Diagrams Causal Inference Markov Factorization statistics Structural Equation Models

Ferenc Huszár

5/2/2021 • EN

Generalisability, prediction, and causation

Explores the distinction between using regression models for causal inference versus predictive inference, and the role of generalizability in prediction.

Causal Inference Data Science Machine Learning Predictive Modeling statistics

Thomas Lumley

2/11/2021 • EN

Co-linearity

A statistical analysis of multicollinearity in regression models, discussing its impact on coefficient interpretation and prediction.

data analysis Modeling Multicollinearity Regression statistics

Thomas Lumley

11/5/2020 • EN

Neyman Allocation, only exact

Explains Neyman allocation for optimal stratified sampling and its exact integer solution, linking it to US Electoral College apportionment.

Allocation Integer Programming optimization sampling statistics

Thomas Lumley

9/20/2020 • EN

Simple Anomaly Detection Using Plain SQL

A guide to implementing a simple anomaly detection system using only SQL and basic statistics, aimed at developers.

Anomaly Detection data analysis sql statistics Z Score

Haki Benita

8/4/2020 • EN

Weights in statistics

Explains the three main types of statistical weights (precision, frequency, sampling), their uses, and the software documentation challenges they create.

data analysis Software Documentation statistics Survey Sampling Weighted Least Squares

Thomas Lumley

4/3/2020 • EN

New in the survey package

Overview of new features in version 4.0 of the R survey package, focusing on improved contrast estimation and replicate handling.

data analysis R Replicate Variance statistics Svycontrast

Thomas Lumley

3/27/2020 • EN

Changing strata mid-stream

Explores the statistical challenges and potential bias when adjusting stratification variables during multi-wave sampling for population estimation.

Population Regression sampling statistics Stratification

Thomas Lumley

10/31/2019 • EN

The secular Bayesian: Using belief distributions without really believing

A data scientist's journey from dogmatic Bayesianism to a pragmatic, 'secular' use of Bayesian tools without requiring belief in the model's literal existence.

Bayesian Inference Data Science Machine Learning Modeling statistics

Ferenc Huszár

10/1/2019 • EN

Some things I don’t like about the Oxford-Munich Code of Conduct

A critique of the Oxford-Munich Code of Conduct for Data Scientists, focusing on its technical recommendations on sampling and data retention.

Code Of Conduct Data Science Ethics sampling statistics

Thomas Lumley

8/11/2019 • EN

NumPy Exercises Part 1

Explains the theory behind linear regression models, a fundamental machine learning algorithm for predicting continuous numerical values.

Linear Regression Machine Learning Numpy Python statistics

Stern Semasuka

6/20/2019 • EN

Updating Statistics on Secondary Replicas of the Availability Groups

A technical guide exploring workarounds to update SQL Server statistics on secondary replicas in Availability Groups, including scripts and methods.

Availability Groups Database Administration Secondary Replicas SQL Server statistics

Niko Neugebauer

6/16/2019 • EN

Analysing the mouse microbiome autism data

A statistical re-analysis of a published study on the mouse microbiome and autism, examining data and p-values from behavioral experiments.

Autism Research data analysis Microbiome R statistics

Thomas Lumley

6/11/2019 • EN

Confidence intervals: not a very strong property

A statistical analysis discussing the limitations of confidence intervals, using examples from small-area sampling to illustrate their weak properties.

Bayesian Inference Confidence Intervals data analysis sampling statistics

Thomas Lumley

4/30/2019 • EN

What does a Data Scientist really do?

A data scientist clarifies common misconceptions about the field, explaining that machine learning is only a small part of the job and advanced degrees aren't always required.

Career data analysis Datascience Machine Learning statistics

Eugene Yan

3/4/2019 • EN

Normal horizontiles

A technical analysis verifying a statistical calculation from an XKCD comic, involving normal distribution probabilities and R code.

Integration Normal Distribution Probability R Programming statistics

Thomas Lumley

3/1/2019 • EN

Displaying bus punctuality

A technical analysis of bus punctuality using Auckland Transport API data, with R code for data processing and visualization.

api data analysis R statistics Visualization

Thomas Lumley

1/29/2019 • EN

Half a dozen frequentist and Bayesian ways to measure the difference in means in two groups

A guide to six statistical methods (frequentist and Bayesian) for comparing group means, with R and Stan code examples.

Bayesian Inference data analysis Frequentist Inference R statistics

Andrew Heiss

1/11/2019 • EN

The Ihaka Lectures 3: Rise of the Machine Learners

Announcement for a lecture series on machine learning, covering topics like Weka, deep learning, algorithmic fairness, and sparse supervised learning.

Algorithmic Fairness Data Science Machine Learning statistics Supervised Learning

Thomas Lumley