Data processing articles

10/13/2018 • EN

Approximate Distinct Count

Explains the APPROX_COUNT_DISTINCT function for faster, memory-efficient distinct counts in SQL, comparing it to exact COUNT(DISTINCT).

algorithm Approximate Distinct Count Big Data data processing Hyperloglog

Niko Neugebauer

10/9/2018 • EN

Processing complicated package outputs

A technical guide on extending the googleway package's google_distance() function in R to handle multiple inputs, cache API calls, and manage errors efficiently.

api data processing Function Optimization Googleway R

Zoe Locke

8/28/2018 • EN

Moving beyond pattern-based analysis: Additional applications of GeoPAT 2

Final post in the GeoPAT 2 series, exploring advanced pattern-based spatial analysis methods and integration into custom workflows.

data processing Geospatial Analysis R Programming Software Tools Spatial Patterns

Jakub Nowosad

6/18/2018 • EN

Parallel, Disk-Efficient .zip to .gz Conversion

A guide to converting many .zip files to .gz format in parallel using a command-line one-liner for efficient disk usage.

compression data processing Parallel Processing Shell Scripting unix

Randy Zwitch

2/23/2018 • EN

Getting Started With OmniSci, Part 2: Electricity Dataset

A technical tutorial on using Python and pandas to process electricity data and load it into OmniSci (formerly MapD) for dashboard creation.

data processing data visualization Omniscidb Pandas Python

Randy Zwitch

4/29/2017 • EN

Intro to R workshop

A summary of a two-day workshop introducing R programming, data processing, visualization, and spatial analysis for beginners in geography and GIS.

data processing data visualization R Rstudio Spatial Analysis

Jakub Nowosad

1/13/2017 • EN

Data Processing and Enrichment in Spark Streaming with Python and Kafka

A technical guide on using Spark Streaming with Python and Kafka to filter and enrich real-time Twitter data for copyright infringement detection.

data processing docker Kafka Python Spark Streaming

Robin Moffatt

1/12/2017 • EN

Getting Started with Spark Streaming, Python, and Kafka

A technical guide to building a streaming data pipeline using Apache Spark Streaming, Python, and Apache Kafka for real-time processing.

data processing Kafka Python Spark Streaming Stream Processing

Robin Moffatt

12/15/2016 • EN

ETL Offload with Spark and Amazon EMR - Part 1 - Introduction

Explores using Apache Spark on Amazon EMR to offload and improve ETL processes, comparing it to traditional Oracle-based solutions.

Amazon Emr aws data processing Etl Spark

Robin Moffatt

7/26/2016 • EN

Stream Analytics and Processing with Kafka and Oracle Stream Analytics

A technical guide on integrating Apache Kafka with Oracle Stream Analytics for real-time data processing and stream analytics.

Apache Kafka data processing Oracle Stream Analytics Real Time Streaming Stream Analytics

Robin Moffatt

2/15/2016 • EN

Building a Stupid Data Product, Part 1: The Data (Python)

A technical tutorial on building a data product using Python, Markov chains, and a dataset of science questions to generate random quiz questions.

Backend data processing github Markov Chains Python

Joel Grus

1/28/2016 • EN

A Million Text Files And A Single Laptop

A technical guide on processing millions of small text files using GNU Parallel and stream processing, without needing Hadoop or a database.

data processing Gnu Parallel Python R Stream Processing

Randy Zwitch

1/20/2015 • EN

Sessionizing Log Data Using data.table [Follow-up #2]

A follow-up article demonstrating a third method for sessionizing log data using R's data.table and magrittr packages.

data processing Datatable Log Analysis R Sessionization

Randy Zwitch

11/1/2014 • EN

The Stack Overflow Tag Engine – Part 1

A technical deep-dive into building a tag engine similar to Stack Overflow's, covering data processing, memory usage, and performance.

data processing performance Search Algorithm Stackoverflow Tag Engine

Matt Warren

12/7/2013 • EN

Lean, Mean Data Science Machine

A guide to using the Unix command-line for efficient data science workflows, including data processing, exploration, and modeling.

data processing Data Science Exploratory Data Analysis Repl Unix Command Line

Jeroen Janssens

11/3/2013 • EN

SQLite

A guide to using SQLite and Python's sqlite3 module to efficiently manage and query large datasets from text files.

data processing database performance Python sqlite

Sebastian Raschka

11/3/2013 • EN

SQLite

A technical guide on using SQLite and Python's sqlite3 module to efficiently manage and query large datasets, replacing slow text file processing.

data processing database performance Python sqlite

Sebastian Raschka

9/19/2013 • EN

7 Command-Line Tools for Data Science

A guide to seven essential command-line tools (jq, csvkit, Rio, etc.) for data scientists to obtain, scrub, explore, and model data.

api Command Line Tools data processing Data Science json

Jeroen Janssens

8/22/2013 • EN

Getting Started Using Hadoop, Part 4: Creating Tables With Hive

A tutorial on using Apache Hive to create tables and views from data loaded into a Hadoop cluster, continuing a multi-part series.

Big Data data processing Hadoop Hive sql

Randy Zwitch

4/18/2013 • EN

Getting Started Using Hadoop, Part 1: Intro

A practical guide introducing Hadoop's ecosystem and setting up a proof-of-concept cluster on Amazon EC2 using Cloudera for big data processing.

Amazon Ec2 Big Data Cloudera data processing Hadoop

Randy Zwitch

Data processing Articles

Approximate Distinct Count

Processing complicated package outputs

Moving beyond pattern-based analysis: Additional applications of GeoPAT 2

Parallel, Disk-Efficient .zip to .gz Conversion

Getting Started With OmniSci, Part 2: Electricity Dataset

Intro to R workshop

Data Processing and Enrichment in Spark Streaming with Python and Kafka

Getting Started with Spark Streaming, Python, and Kafka

ETL Offload with Spark and Amazon EMR - Part 1 - Introduction

Stream Analytics and Processing with Kafka and Oracle Stream Analytics

Building a Stupid Data Product, Part 1: The Data (Python)

A Million Text Files And A Single Laptop

Sessionizing Log Data Using data.table [Follow-up #2]

The Stack Overflow Tag Engine – Part 1

Lean, Mean Data Science Machine

SQLite

SQLite

7 Command-Line Tools for Data Science

Getting Started Using Hadoop, Part 4: Creating Tables With Hive

Getting Started Using Hadoop, Part 1: Intro

Select Language

We use cookies