Hidden Pitfalls — Compaction and Partition Evolution in Apache Iceberg
Explores challenges and best practices for managing partition evolution and compaction in Apache Iceberg to maintain query performance.
Alex Merced — Developer and technical writer sharing in-depth insights on data engineering, Apache Iceberg, data lakehouse architectures, Python tooling, and modern analytics platforms, with a strong focus on practical, hands-on learning.
418 articles from this blog
Explores challenges and best practices for managing partition evolution and compaction in Apache Iceberg to maintain query performance.
Explains how to use Apache Iceberg's metadata tables to dynamically trigger data compaction based on file size, manifest health, and snapshot patterns.
A guide to scheduling compaction and snapshot expiration in Apache Iceberg tables based on workload patterns and infrastructure constraints.
Explains how to manage Apache Iceberg table metadata by expiring old snapshots and rewriting manifests to prevent performance and cost issues.
Explains how to use sorting and Z-order clustering in Apache Iceberg tables to optimize query performance and data layout.
Explains techniques for incremental, non-disruptive compaction in Apache Iceberg tables under continuous streaming data ingestion.
Explains data compaction using bin packing in Apache Iceberg to merge small files, improve query performance, and reduce metadata overhead.
Explains how Apache Iceberg tables degrade without optimization, covering small files, fragmented manifests, and performance impacts.
A guide on how to find, join, and organize community meetups focused on Apache Iceberg and modern data lakehouse architectures.
Explains batch processing fundamentals for data engineering, covering concepts, tools, and its ongoing relevance in data workflows.
Explains data lakes, their key characteristics, and how they differ from data warehouses in modern data architecture.
An introduction to data engineering concepts, focusing on data sources and ingestion strategies like batch vs. streaming.
Explains streaming data fundamentals, how streaming systems work, their use cases, and challenges compared to batch processing.
Explains the importance of data storage formats and compression for performance and cost in large-scale data engineering systems.
Explains core data engineering concepts, comparing ETL and ELT data pipeline strategies and their use cases.
Explains core data engineering concepts: metadata, data lineage, and governance, and their importance for scalable, compliant data systems.
An introduction to data warehousing concepts, covering architecture, components, and performance optimization for analytical workloads.
Explores the importance of data quality and validation in data engineering, covering key dimensions and tools for reliable pipelines.
An introduction to data modeling concepts, covering OLTP vs OLAP systems, normalization, and common schema designs for data engineering.
An introductory guide to data engineering, explaining its role, key concepts, and how it differs from data science in the modern data ecosystem.