All About Parquet Part 03 - Parquet File Structure | Pages, Row Groups, and Columns
Explains the hierarchical structure of Parquet files, detailing how pages, row groups, and columns optimize storage and query performance.
Explains the hierarchical structure of Parquet files, detailing how pages, row groups, and columns optimize storage and query performance.
Explains how Parquet handles schema evolution, including adding/removing columns and changing data types, for data engineers.
A practical guide to reading and writing Parquet files in Python using PyArrow and FastParquet libraries.
Explores using GitHub Actions for software development CI/CD and advanced data engineering tasks like ETL pipelines and data orchestration.
A former Debezium lead argues that Change Data Capture (CDC) is a feature within larger data platforms, not a standalone product.
Explores the core reasons for using Change Data Capture (CDC) to extract data from operational databases for analytics and other applications.
A comprehensive directory of Apache Iceberg resources, including tutorials, guides, and educational materials for data engineers and developers.
A technical guide on configuring Apache Flink to write data to Delta Lake tables stored on S3, including required JARs and configuration steps.
Overview of a university-level Data Engineering course syllabus covering tools, pipelines, AI pair programming, and project-based learning for Fall 2024.
A list of upcoming tech talks and events by Alex Merced, focusing on Apache Iceberg, data lakehouses, and data engineering topics.
A video course covering the fundamentals of lakehouse engineering using Apache Iceberg, Nessie, and Dremio for data management.
A data professional shares their curated list of data tech blogs and explains their return to using RSS feeds to stay current in the field.
Explains three key Apache Iceberg features for data engineers: hidden partitioning, partition evolution, and tool compatibility.
A data engineer reflects on their 2-year career journey at the City of Boston, sharing lessons learned in data warehousing, ETL, and civic tech.
An introduction to Apache Iceberg, a table format for data lakehouses, explaining its architecture and providing learning resources.
Explores the evolution of Apache Iceberg catalogs, focusing on the current REST Catalog and future proposals for server-side optimizations.
A hands-on tutorial on building a data lakehouse pipeline using Spark, Dremio, and Superset to move and analyze data.
A guide to using Apache Flink's SQL Gateway REST API for submitting and managing SQL jobs, including practical examples with Postman and HTTPie.
Monthly roundup of articles and resources on data streaming, covering Flink, Kafka, Debezium, and streaming SQL developments.
Explains the role and types of catalogs in Apache Flink SQL, comparing them to traditional RDBMS systems and highlighting their importance in data management.