Big Data articles

1/30/2026 • EN

Microsoft's 2026 Global ML Building Footprints

Analysis of Microsoft's 2026 Global ML Building Footprints dataset, including technical setup and data exploration using DuckDB and QGIS.

Big Data Dataset Duckdb Geospatial Data Machine Learning

Mark Litwintschik

4/10/2025 • EN

PySpark 101: Introduction to Big Data with Spark

A beginner-friendly introduction to using PySpark for big data processing with Apache Spark, covering the fundamentals.

Apache Spark Big Data distributed computing Pyspark Python

Matt Layman

3/9/2025 • EN

9 new books added to Big Book of R

Announces 9 new free and paid books added to the Big Book of R collection, covering data science, visualization, and package development.

Big Data Data Science Package Development R Programming Statistical Computing

Oscar Baruffa

1/20/2025 • EN

2025 Comprehensive Guide to Apache Iceberg

A comprehensive 2025 guide to Apache Iceberg, covering its architecture, ecosystem, and practical use for data lakehouse management.

Apache Iceberg Big Data Data Engineering Data Lakehouse Table Format

Alex Merced

10/21/2024 • EN

All About Parquet Part 01 - An Introduction

An introduction to Apache Parquet, a columnar storage file format for efficient data processing and analytics.

Big Data Columnar Storage Data Engineering Data Format Parquet

Alex Merced

10/21/2024 • EN

All About Parquet Part 03 - Parquet File Structure | Pages, Row Groups, and Columns

Explains the hierarchical structure of Parquet files, detailing how pages, row groups, and columns optimize storage and query performance.

Big Data Columnar Storage Data Engineering File Format Parquet

Alex Merced

10/21/2024 • EN

All About Parquet Part 09 - Parquet in Data Lake Architectures

Explores why Parquet is the ideal columnar file format for optimizing storage and query performance in modern data lake and lakehouse architectures.

Apache Iceberg Big Data Columnar Storage Data Lake Parquet

Alex Merced

10/21/2024 • EN

All About Parquet Part 10 - Performance Tuning and Best Practices with Parquet

Final guide in a series covering performance tuning and best practices for optimizing Apache Parquet files in big data workflows.

Big Data Data Compression Data Lake Parquet performance tuning

Alex Merced

4/4/2024 • EN

A Deep Intro to Apache Iceberg and Resources for Learning More

An introduction to Apache Iceberg, a table format for data lakehouses, explaining its architecture and providing learning resources.

Apache Iceberg Big Data Data Engineering Data Lakehouse Table Format

Alex Merced

2/13/2024 • EN

Datacast Episode 132: Big Data Engineering, Data Culture from First Principles, and Reimagined Metadata with Suresh Srinivas

Interview with Suresh Srinivas on his career in big data, founding Hortonworks, scaling Uber's data platform, and leading the OpenMetadata project.

Apache Hadoop Big Data Data Engineering metadata Openmetadata

James Le

2/12/2024 • EN

Partitioning Practices in Apache Hive and Apache Iceberg

Compares partitioning techniques in Apache Hive and Apache Iceberg, highlighting Iceberg's advantages for query performance and data management.

Apache Hive Apache Iceberg Big Data Data Partitioning Query Optimization

Alex Merced

1/1/2024 • EN

The One Billion Row Challenge

A Java programming challenge to process one billion rows of temperature data, focusing on performance optimization and modern Java features.

benchmarking Big Data concurrency Java performance optimization

Gunnar Morling

11/22/2022 • EN

Understanding Spark Configurations with Apache Iceberg

A guide to configuring Apache Spark for use with the Apache Iceberg table format, covering packages, flags, and programmatic setup.

Apache Iceberg Apache Spark Big Data Data Lake Spark Configurations

Alex Merced

11/24/2019 • EN

Data is Overrated*

Argues that raw data is overvalued without proper context and conversion into meaningful information and knowledge.

ai Big Data Data Information Knowledge

Niko Neugebauer

10/13/2018 • EN

Approximate Distinct Count

Explains the APPROX_COUNT_DISTINCT function for faster, memory-efficient distinct counts in SQL, comparing it to exact COUNT(DISTINCT).

algorithm Approximate Distinct Count Big Data data processing Hyperloglog

Niko Neugebauer

3/5/2018 • EN

Faster generalised linear models in largeish data

A method for faster generalized linear models on large datasets using a single database query and one Newton-Raphson iteration.

Big Data Generalized Linear Models optimization R Statistical Computing

Thomas Lumley

10/17/2016 • EN

Is Privacy Dead?

A personal reflection on the trade-offs between convenience and privacy in an era of AI, IoT, and pervasive data collection.

artificial intelligence Big Data cybersecurity Internet Of Things privacy

Carlos Mendible

5/20/2016 • EN

Better Python compressed persistence in joblib

Explains improvements in joblib's compressed persistence for Python, focusing on reduced memory usage and single-file storage for large numpy arrays.

Big Data compression Joblib Persistence Python

Gael Varoquaux

2/7/2016 • EN

Sentiment analysis of tweets

Technical guide on building a real-time Twitter sentiment analysis system using Apache Kafka and Storm.

Apache Kafka Apache Storm Big Data Real Time Processing Sentiment Analysis

Marçal Serrate

12/13/2015 • EN

Big Data: streams and lambdas

Explains Lambda Architecture for Big Data, combining batch processing (Hadoop) and real-time stream processing (Spark, Storm) to handle large datasets.

Batch Processing Big Data Hadoop Lambda Architecture Stream Processing

Marçal Serrate

Big Data Articles

Microsoft's 2026 Global ML Building Footprints

PySpark 101: Introduction to Big Data with Spark

9 new books added to Big Book of R

2025 Comprehensive Guide to Apache Iceberg

All About Parquet Part 01 - An Introduction

All About Parquet Part 03 - Parquet File Structure | Pages, Row Groups, and Columns

All About Parquet Part 09 - Parquet in Data Lake Architectures

All About Parquet Part 10 - Performance Tuning and Best Practices with Parquet

A Deep Intro to Apache Iceberg and Resources for Learning More

Datacast Episode 132: Big Data Engineering, Data Culture from First Principles, and Reimagined Metadata with Suresh Srinivas

Partitioning Practices in Apache Hive and Apache Iceberg

The One Billion Row Challenge

Understanding Spark Configurations with Apache Iceberg

Data is Overrated*

Approximate Distinct Count

Faster generalised linear models in largeish data

Is Privacy Dead?

Better Python compressed persistence in joblib

Sentiment analysis of tweets

Big Data: streams and lambdas

Select Language