Data Engineering articles

1/6/2025 • EN

RAG Isn’t a Modeling Problem. It’s a Data Engineering Problem.

Argues that RAG system failures stem from data engineering issues like fragmented data and governance, not from model or vector database choices.

Data Engineering Hybrid Search latency Rag Vector Databases

Alex Merced

12/19/2024 • EN

Overture Maps' Refreshed Global Geospatial Datasets

Overview of Overture Maps Foundation's updated global, open geospatial datasets, their partners, and data refresh strategy.

cloud storage Data Engineering Geospatial Data Open Data 깃

Mark Litwintschik

12/19/2024 • EN

Checkpoint Chronicle - December 2024

Monthly roundup of news and resources in data streaming, stream processing, and the Apache Kafka ecosystem, curated by industry experts.

Apache Flink Apache Kafka Data Engineering Event Streaming Stream Processing

Robin Moffatt

12/11/2024 • EN

Exploring Flink CDC

An overview of Apache Flink CDC, its declarative pipeline feature, and how it simplifies data integration from databases like MySQL to sinks like Elasticsearch.

Apache Flink change data capture Data Engineering Flink Cdc sql

Robin Moffatt

11/4/2024 • EN

dbt Community Spotlight

A profile of a Senior Analytics Engineer specializing in dbt, data mesh architecture, and applying library science principles to modern data teams.

Analytics Engineering Data Engineering Data Governance Data Mesh Dbt

Jenna Jordan

10/30/2024 • EN

Checkpoint Chronicle - October 2024

Monthly roundup of news and developments in data streaming, stream processing, and the data ecosystem, featuring Apache Flink, Kafka, and open-source tools.

Apache Flink Data Engineering Event Streaming Stream Processing Streaming SQL

Robin Moffatt

10/21/2024 • EN

All About Parquet Part 01 - An Introduction

An introduction to Apache Parquet, a columnar storage file format for efficient data processing and analytics.

Big Data Columnar Storage Data Engineering Data Format Parquet

Alex Merced

10/21/2024 • EN

All About Parquet Part 03 - Parquet File Structure | Pages, Row Groups, and Columns

Explains the hierarchical structure of Parquet files, detailing how pages, row groups, and columns optimize storage and query performance.

Big Data Columnar Storage Data Engineering File Format Parquet

Alex Merced

10/21/2024 • EN

All About Parquet Part 04 - Schema Evolution in Parquet

Explains how Parquet handles schema evolution, including adding/removing columns and changing data types, for data engineers.

Data Engineering Data Management File Format Parquet Schema Evolution

Alex Merced

10/21/2024 • EN

All About Parquet Part 08 - Reading and Writing Parquet Files in Python

A practical guide to reading and writing Parquet files in Python using PyArrow and FastParquet libraries.

Data Engineering Fastparquet Parquet Pyarrow Python

Alex Merced

10/19/2024 • EN

A Deep Dive Into GitHub Actions From Software Development to Data Engineering

Explores using GitHub Actions for software development CI/CD and advanced data engineering tasks like ETL pipelines and data orchestration.

automation ci/cd Data Engineering DevOps Github Actions

Alex Merced

10/18/2024 • EN

CDC Is a Feature Not a Product

A former Debezium lead argues that Change Data Capture (CDC) is a feature within larger data platforms, not a standalone product.

change data capture Data Engineering Debezium Kafka Real Time Data

Gunnar Morling

10/15/2024 • EN

Why Do I Need CDC?

Explores the core reasons for using Change Data Capture (CDC) to extract data from operational databases for analytics and other applications.

change data capture Data Engineering Data Integration database Oltp

Robin Moffatt

10/5/2024 • EN

Ultimate Directory of Apache Iceberg Resources

A comprehensive directory of Apache Iceberg resources, including tutorials, guides, and educational materials for data engineers and developers.

Apache Iceberg Data Engineering Data Lakehouse metadata Table Format

Alex Merced

8/27/2024 • EN

Adventures with Apache Flink and Delta Lake

A technical guide on configuring Apache Flink to write data to Delta Lake tables stored on S3, including required JARs and configuration steps.

Apache Flink Apache Spark Data Engineering Delta Lake Open Table Format

Robin Moffatt

8/26/2024 • EN

Data Engineering Duke Fall 2023-2024

Overview of a university-level Data Engineering course syllabus covering tools, pipelines, AI pair programming, and project-based learning for Fall 2024.

AI Pair Programming Cloud Platforms Data Engineering Data Pipelines Syllabus

Noah Gift