Data Engineering articles

4/8/2024 • EN

Reflecting on my tenure at the City of Boston

A data engineer reflects on their 2-year career journey at the City of Boston, sharing lessons learned in data warehousing, ETL, and civic tech.

analytics Civic Tech Data Engineering Etl Pipelines

Jenna Jordan

4/4/2024 • EN

A Deep Intro to Apache Iceberg and Resources for Learning More

An introduction to Apache Iceberg, a table format for data lakehouses, explaining its architecture and providing learning resources.

Apache Iceberg Big Data Data Engineering Data Lakehouse Table Format

Alex Merced

4/4/2024 • EN

Understanding the Future of Apache Iceberg Catalogs

Explores the evolution of Apache Iceberg catalogs, focusing on the current REST Catalog and future proposals for server-side optimizations.

Apache Iceberg Catalog Data Engineering Data Lakehouse rest api

Alex Merced

4/1/2024 • EN

End-to-End Basic Data Engineering Tutorial (Spark, Dremio, Superset)

A hands-on tutorial on building a data lakehouse pipeline using Spark, Dremio, and Superset to move and analyze data.

Apache Superset Data Engineering Data Lakehouse Dremio Spark

Alex Merced

3/12/2024 • EN

Exploring the Flink SQL Gateway REST API

A guide to using Apache Flink's SQL Gateway REST API for submitting and managing SQL jobs, including practical examples with Postman and HTTPie.

Apache Flink Data Engineering rest api SQL Gateway Stream Processing

Robin Moffatt

2/22/2024 • EN

Checkpoint Chronicle - February 2024

Monthly roundup of articles and resources on data streaming, covering Flink, Kafka, Debezium, and streaming SQL developments.

Apache Kafka Data Engineering Event Streaming Flink Stream Processing

Robin Moffatt

2/16/2024 • EN

Catalogs in Flink SQL—A Primer

Explains the role and types of catalogs in Apache Flink SQL, comparing them to traditional RDBMS systems and highlighting their importance in data management.

Apache Flink Catalog Data Engineering Flink SQL SQL Ddl

Robin Moffatt

2/13/2024 • EN

Datacast Episode 132: Big Data Engineering, Data Culture from First Principles, and Reimagined Metadata with Suresh Srinivas

Interview with Suresh Srinivas on his career in big data, founding Hortonworks, scaling Uber's data platform, and leading the OpenMetadata project.

Apache Hadoop Big Data Data Engineering metadata Openmetadata

James Le

1/19/2024 • EN

Open Lakehouse Engineering/Apache Iceberg Lakehouse Engineering - A Directory of Resources

A comprehensive directory of resources for learning about and building Open Lakehouses using Apache Iceberg, Nessie, and Dremio.

Apache Iceberg Data Engineering Data Lakehouse Dremio Open Standards

Alex Merced

1/8/2024 • EN

Nessie - An Alternative to Hive & JDBC for Self-Managed Apache Iceberg Catalogs

Introduces Nessie as a self-managed catalog alternative to Hive & JDBC for Apache Iceberg, addressing limitations and new features.

Apache Iceberg Data Catalog Data Engineering Metadata Management Nessie

Alex Merced

11/14/2023 • EN

Can Debezium Lose Events?

Explores whether Debezium can lose database change events, explaining its at-least-once semantics and operational pitfalls like log retention.

change data capture Data Engineering Debezium Event Streaming Transaction Log

Gunnar Morling

11/14/2023 • EN

Can Debezium Lose Events?

Explores whether the Debezium change data capture tool can lose database events, discussing its at-least-once semantics and operational pitfalls.

change data capture Data Engineering Debezium Event Streaming Transaction Log

Gunnar Morling

11/14/2023 • EN

Checkpoint Chronicle - November 2023

Monthly roundup of data streaming trends, featuring Apache Iceberg, Kafka Streams, Flink deployments, and streaming SQL insights.

Apache Flink Apache Iceberg Apache Kafka Data Engineering Stream Processing

Robin Moffatt

11/2/2023 • EN

CDC Use Cases: 7 Ways to Put CDC to Work

Explores seven practical use cases for Change Data Capture (CDC) in data engineering, including analytics, caches, and microservices.

change data capture Data Engineering Database Integration Debezium Real Time Data

Gunnar Morling

11/2/2023 • EN

CDC Use Cases: 7 Ways to Put CDC to Work

Explores seven practical use cases for Change Data Capture (CDC) in data engineering, including analytics, caches, and microservices.

change data capture Data Engineering Database Integration Debezium Real Time Data

Gunnar Morling

10/2/2023 • EN

Learning Apache Flink S01E02: What is Flink?

An introductory overview of Apache Flink, explaining its core concepts as a distributed stream processing framework, its history, and primary use cases.

Apache Flink Big Data Data Engineering distributed systems Stream Processing

Robin Moffatt

9/21/2023 • EN

An Itch That Just Has to Be Scratched… (Or, Why Am I Joining Decodable?)

Author explains their move to Decodable to dive deeper into stream processing, Apache Flink, and work with experts in the field.

Apache Flink Apache Kafka Data Engineering Stream Processing streaming

Robin Moffatt

9/10/2023 • EN

TWIL: September 10, 2023

A weekly tech learning digest covering Microsoft Fabric, AI topics, computer vision, Azure AI Document Intelligence, embeddings, and vector search.

Azure AI computer vision Data Engineering Etl Microsoft Fabric

André Vala

8/13/2023 • EN

Analytical Data Warehouses - an introduction

An introduction to analytical data warehouses, explaining their purpose, differences from transactional databases, and their role in team-based analytics.

analytics Data Engineering Data Warehousing database Dbt

Jenna Jordan