Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup
Guide to maintaining Apache Iceberg tables with compaction, snapshot expiry, orphan cleanup, and manifest rewriting for optimal performance.
Alex Merced — Developer and technical writer sharing in-depth insights on data engineering, Apache Iceberg, data lakehouse architectures, Python tooling, and modern analytics platforms, with a strong focus on practical, hands-on learning.
501 articles from this blog
Guide to maintaining Apache Iceberg tables with compaction, snapshot expiry, orphan cleanup, and manifest rewriting for optimal performance.
Explains row vs column storage layouts in databases, their I/O tradeoffs, compression benefits, and when to use each for query engines.
Explains how Apache Iceberg uses metadata for data skipping, enabling fast query performance by eliminating 90-99% of files before scanning.
Explores embedding Iceberg catalogs directly into storage, covering AWS S3 Tables and MinIO AI Stor for simplified metadata management.
Explains concurrency control in databases, covering 2PL, MVCC, isolation levels, and OCC for handling simultaneous reads/writes.
Explains how Apache Iceberg enables partition evolution without rewriting data, solving a major data lake challenge.
Guide to using Apache Iceberg with Python libraries (PyIceberg, DuckDB, Polars) and MPP query engines like Dremio, Spark, and Trino.
Explains why table formats like Apache Iceberg and Delta Lake are essential for reliable data lakes, solving atomic commits, schema evolution, and time travel.
Explains how Apache Iceberg table writes work, including commit steps and ACID guarantees on object storage.
Explores three streaming architectures for Apache Iceberg: Spark Structured Streaming, Flink, and Kafka Connect, focusing on trade-offs between latency and table maintenance.
A technical deep dive comparing metadata structures of modern table formats like Apache Iceberg, Delta Lake, and Hudi for data lakes.
Explores query execution models: Volcano (row-at-a-time), vectorized (batch processing), and compiled code generation for CPU efficiency.
Explains Apache Iceberg metadata tables for querying table internals using SQL, covering snapshots, files, manifests, partitions, and practical use cases.
Explains lakehouse catalogs in Apache Iceberg, their role in metadata management, and how to choose between open source and managed options.
Strategies for migrating data to Apache Iceberg, including in-place, full rewrite, and shadow migration with zero downtime.
Explains distributed join strategies: shuffle, broadcast, and co-located joins, focusing on network costs in query engines.
Explains how Apache Arrow eliminates the serialization tax by providing a standardized in-memory columnar format for fast data movement.
Explains Apache Parquet's columnar architecture, dictionary encoding, and performance benefits for data analytics.
Apache Polaris is an open-source catalog service that unifies the Iceberg ecosystem by implementing the Iceberg REST API for vendor-neutral lakehouse metadata management.
Explains Apache Iceberg, a table format that replaces directory-based metadata with file-level tracking for scalable analytics on cloud storage.