How to Think Like a Data Engineer
A guide to the core principles and systems thinking required for data engineering, beyond just learning specific tools.
A guide to the core principles and systems thinking required for data engineering, beyond just learning specific tools.
Explores the limitations of traditional pull queries in data systems and advocates for using materialized views and data duplication to improve performance.
A comprehensive guide to learning Apache Iceberg, data lakehouse architecture, and Agentic AI with curated tutorials, tools, and resources.
A technical guide on using Apache Iceberg with Apache Spark and Polaris for building and managing a data lakehouse, covering setup, operations, and optimization.
Overview of key proposals in Apache Iceberg v4, focusing on performance, metadata efficiency, and portability for modern data workloads.
A monthly roundup of 78 curated links on data engineering, architecture, AI, and tech trends, with top picks highlighted.
A monthly roundup of curated links and articles focused on data engineering, Apache Kafka, and data platform technologies.
A guide to scheduling compaction and snapshot expiration in Apache Iceberg tables based on workload patterns and infrastructure constraints.
A monthly roundup of data engineering links covering Apache Iceberg, Kafka, Debezium, Spark, and lakehouse architecture.
Explains how Apache Iceberg tables degrade without optimization, covering small files, fragmented manifests, and performance impacts.
Explains the importance of table maintenance in Apache Iceberg for data lakehouses, covering metadata and file management.
An analysis of DuckLake, a new open table format and catalog specification for data engineering, comparing it to existing solutions like Iceberg and Delta Lake.
A monthly roundup of curated links and articles covering data engineering, Kafka, stream processing, and AI, with top picks highlighted.
Explains streaming data fundamentals, how streaming systems work, their use cases, and challenges compared to batch processing.
Explains the data lakehouse architecture, a unified approach combining data lake scalability with warehouse management features like ACID transactions.
Explores the modern data stack, cloud platforms, and principles for building flexible, cloud-native data engineering architectures.
Explores how DevOps principles like CI/CD, infrastructure as code, and monitoring are applied to data engineering for reliable, scalable data pipelines.
Explains batch processing fundamentals for data engineering, covering concepts, tools, and its ongoing relevance in data workflows.
Explores core principles of scalable data engineering, including parallelism, minimizing data movement, and designing adaptable pipelines for growing data volumes.
Explains core data engineering concepts, comparing ETL and ELT data pipeline strategies and their use cases.