You Gotta Push If You Wanna Pull
Explores the limitations of traditional pull queries in data systems and advocates for using materialized views and data duplication to improve performance.
Explores the limitations of traditional pull queries in data systems and advocates for using materialized views and data duplication to improve performance.
A comprehensive guide to learning Apache Iceberg, data lakehouse architecture, and Agentic AI with curated tutorials, tools, and resources.
A technical guide on using Apache Iceberg with Apache Spark and Polaris for building and managing a data lakehouse, covering setup, operations, and optimization.
Overview of key proposals in Apache Iceberg v4, focusing on performance, metadata efficiency, and portability for modern data workloads.
A guide to scheduling compaction and snapshot expiration in Apache Iceberg tables based on workload patterns and infrastructure constraints.
Explains how Apache Iceberg tables degrade without optimization, covering small files, fragmented manifests, and performance impacts.
Explores core principles of scalable data engineering, including parallelism, minimizing data movement, and designing adaptable pipelines for growing data volumes.
Explores workflow orchestration in data engineering, covering DAGs, tools, and best practices for managing complex data pipelines.
Explains the importance of data storage formats and compression for performance and cost in large-scale data engineering systems.
Explores how DevOps principles like CI/CD, infrastructure as code, and monitoring are applied to data engineering for reliable, scalable data pipelines.
An introductory guide to data engineering, explaining its role, key concepts, and how it differs from data science in the modern data ecosystem.
Explains batch processing fundamentals for data engineering, covering concepts, tools, and its ongoing relevance in data workflows.
An introduction to data modeling concepts, covering OLTP vs OLAP systems, normalization, and common schema designs for data engineering.
Explains streaming data fundamentals, how streaming systems work, their use cases, and challenges compared to batch processing.
Explains core data engineering concepts, comparing ETL and ELT data pipeline strategies and their use cases.
An introduction to data engineering concepts, focusing on data sources and ingestion strategies like batch vs. streaming.
Explains data lakes, their key characteristics, and how they differ from data warehouses in modern data architecture.
Explains core data engineering concepts: metadata, data lineage, and governance, and their importance for scalable, compliant data systems.
Explores the importance of data quality and validation in data engineering, covering key dimensions and tools for reliable pipelines.
An introduction to data warehousing concepts, covering architecture, components, and performance optimization for analytical workloads.