Simple Anomaly Detection Using Plain SQL
A guide to implementing a simple anomaly detection system using only SQL and basic statistics, aimed at developers.
A guide to implementing a simple anomaly detection system using only SQL and basic statistics, aimed at developers.
Explores the Bayesian equivalent of a two-sample t-test, questioning traditional assumptions and proposing a model using discrete distributions.
Explains the three main types of statistical weights (precision, frequency, sampling), their uses, and the software documentation challenges they create.
A data scientist shares three essential pre-project tasks—the one-pager, time-box, and breakdown—to avoid common pitfalls and ensure project success.
Overview of new features in version 4.0 of the R survey package, focusing on improved contrast estimation and replicate handling.
Explains how to use Monte Carlo analysis for product development, using TweetDeck screen capacity as a practical example.
A review of the best #TidyTuesday data visualization submissions from 2019, highlighting creative and insightful uses of R and ggplot2.
A guide on using PowerShell and a matrix/spreadsheet approach to visualize and audit Active Directory group memberships for IT administration.
Tips for using Google BigQuery's public datasets while managing and minimizing query costs, including using the free tier and setting budgets.
A guide to common SQL mistakes and optimization opportunities for developers and data professionals, covering integer division, UNION vs UNION ALL, and query performance.
Compares the runtime performance of pandas' crosstab, groupby, and pivot_table methods for data aggregation.
A statistical re-analysis of a published study on the mouse microbiome and autism, examining data and p-values from behavioral experiments.
A statistical analysis discussing the limitations of confidence intervals, using examples from small-area sampling to illustrate their weak properties.
A technical walkthrough of creating a word cloud visualization from highly-gilded Reddit comments using Python, spaCy, and BigQuery.
A data scientist clarifies common misconceptions about the field, explaining that machine learning is only a small part of the job and advanced degrees aren't always required.
An analysis of user-created Sankey diagrams from Reddit, visualizing personal Tinder match data and dating outcomes.
A tutorial on creating line graphs in R using the ggplot2 package's geom_line function, with examples using the built-in Orange dataset.
Blog author offers free 45-minute one-on-one R training sessions to 10 people, focusing on data analysis, visualization, and package development.
A developer explores investigative journalism, drawing parallels between source control diffs and uncovering truth in legal documents and online comments.
A technical analysis of bus punctuality using Auckland Transport API data, with R code for data processing and visualization.