Approximate Distinct Count
Explains the APPROX_COUNT_DISTINCT function for faster, memory-efficient distinct counts in SQL, comparing it to exact COUNT(DISTINCT).
Explains the APPROX_COUNT_DISTINCT function for faster, memory-efficient distinct counts in SQL, comparing it to exact COUNT(DISTINCT).
A technical guide on extending the googleway package's google_distance() function in R to handle multiple inputs, cache API calls, and manage errors efficiently.
Final post in the GeoPAT 2 series, exploring advanced pattern-based spatial analysis methods and integration into custom workflows.
A guide to converting many .zip files to .gz format in parallel using a command-line one-liner for efficient disk usage.
A technical tutorial on using Python and pandas to process electricity data and load it into OmniSci (formerly MapD) for dashboard creation.
A summary of a two-day workshop introducing R programming, data processing, visualization, and spatial analysis for beginners in geography and GIS.
A technical guide on using Spark Streaming with Python and Kafka to filter and enrich real-time Twitter data for copyright infringement detection.
A technical guide to building a streaming data pipeline using Apache Spark Streaming, Python, and Apache Kafka for real-time processing.
Explores using Apache Spark on Amazon EMR to offload and improve ETL processes, comparing it to traditional Oracle-based solutions.
A technical guide on integrating Apache Kafka with Oracle Stream Analytics for real-time data processing and stream analytics.
A technical tutorial on building a data product using Python, Markov chains, and a dataset of science questions to generate random quiz questions.
A technical guide on processing millions of small text files using GNU Parallel and stream processing, without needing Hadoop or a database.
A follow-up article demonstrating a third method for sessionizing log data using R's data.table and magrittr packages.
A technical deep-dive into building a tag engine similar to Stack Overflow's, covering data processing, memory usage, and performance.
A guide to using the Unix command-line for efficient data science workflows, including data processing, exploration, and modeling.
A guide to using SQLite and Python's sqlite3 module to efficiently manage and query large datasets from text files.
A technical guide on using SQLite and Python's sqlite3 module to efficiently manage and query large datasets, replacing slow text file processing.
A guide to seven essential command-line tools (jq, csvkit, Rio, etc.) for data scientists to obtain, scrub, explore, and model data.
A tutorial on using Apache Hive to create tables and views from data loaded into a Hadoop cluster, continuing a multi-part series.
A practical guide introducing Hadoop's ecosystem and setting up a proof-of-concept cluster on Amazon EC2 using Cloudera for big data processing.