All About Parquet Part 10 - Performance Tuning and Best Practices with Parquet
Final guide in a series covering performance tuning and best practices for optimizing Apache Parquet files in big data workflows.
Final guide in a series covering performance tuning and best practices for optimizing Apache Parquet files in big data workflows.
Explores why Parquet is the ideal columnar file format for optimizing storage and query performance in modern data lake and lakehouse architectures.
Explains how Apache Iceberg brings ACID transaction guarantees to data lakes, enabling reliable data operations on open storage.
Explains the data lakehouse architecture, its layers (storage, table format, catalog, processing), and its advantages over traditional data warehouses.
Explores Apache Iceberg's advanced partitioning features, including hidden partitioning and transformations, for optimizing query performance in data lakes.
Explains three key Apache Iceberg features for data engineers: hidden partitioning, partition evolution, and tool compatibility.
Project Nessie is a version control system for data lakes, bringing Git-like operations to manage and track changes in data assets.
Explains the data lakehouse concept, Dremio's role as a platform, and Apache Iceberg's function as a table format for modern data architectures.
A guide to configuring Apache Spark for use with the Apache Iceberg table format, covering packages, flags, and programmatic setup.