Just Beat the Data Out of It
Analyzing patterns in Bob Ross Twitch chat data using n-gram frequency, percentiles, and spikiness scores to identify event-driven viewer reactions.
Analyzing patterns in Bob Ross Twitch chat data using n-gram frequency, percentiles, and spikiness scores to identify event-driven viewer reactions.
A tutorial on analyzing Seattle's Pronto CycleShare data using Python, Pandas, and the PyData stack for data science.
A data-driven critique of a popular Kenyan tech blog, analyzing its content focus using R programming and text mining techniques.
DataKind Singapore's Project Accelerator connects volunteer data scientists with nonprofits to solve data challenges, like analyzing water consumption data.
A scientist explains why Python is their preferred language for machine learning and data analysis, arguing for productivity over language wars.
A technical guide on analyzing Adobe Analytics Clickstream Data Feed using R, covering file structure, data verification, and initial processing.
Analyzes the historical and technical reasons behind R's controversial 'stringsAsFactors' default, explaining its origins and the problems it causes.
RSiteCatalyst v1.4.4 release notes detail a major bug fix for sparse data errors and minor updates to authentication messaging.
Critique of using Shapiro-Wilk normality tests on large, complex survey data like NHANES, explaining why it's statistically inappropriate.
Explains how to use SQL window functions and percentiles in Postgres for more meaningful data analysis than simple averages.
A guide to getting started with Structural Equation Modeling (SEM) in R using the Lavaan package, based on a user group presentation.
Interview with data scientist Jeroen Janssens about his background, work on data science at the command line, and his Data Science Toolbox project.
A guide to visualizing and diagnosing Generalized Linear Mixed Models (GLMMs) in R, based on a presentation and blog post by Jaime Ashander.
A technical tutorial on using the ELK Stack (Elasticsearch, Logstash, Kibana) to import and analyze open CSV data from DonorsChoose.
The article debunks common misinterpretations of the Dunning-Kruger effect by analyzing the original study's data and findings.
A tutorial explaining the internals of Principal Component Analysis (PCA) for dimensionality reduction in machine learning and data analysis.
A technical tutorial on sessionizing log data using the dplyr package in R, comparing it to a previous SQL-based approach.
Release notes for RSiteCatalyst v1.4.1, detailing bug fixes and new API functions for Adobe Analytics reporting in R.
A technical guide to Dixon's Q test for identifying outliers in small datasets, including its method, application, and criticisms.
A follow-up analysis of U.S. federal .gov domains, tracking changes in technology, security, and accessibility over three years.