All About Parquet Part 03 - Parquet File Structure | Pages, Row Groups, and Columns
Explains the hierarchical structure of Parquet files, detailing how pages, row groups, and columns optimize storage and query performance.
Explains the hierarchical structure of Parquet files, detailing how pages, row groups, and columns optimize storage and query performance.
Final guide in a series covering performance tuning and best practices for optimizing Apache Parquet files in big data workflows.
Explores why Parquet is the ideal columnar file format for optimizing storage and query performance in modern data lake and lakehouse architectures.
A practical guide to reading and writing Parquet files in Python using PyArrow and FastParquet libraries.
Explores how metadata in Parquet files improves data efficiency and query performance, covering file, row group, and column-level metadata.
Explores compression algorithms in Parquet files, comparing Snappy, Gzip, Brotli, Zstandard, and LZO for storage and performance.
Explains how Parquet handles schema evolution, including adding/removing columns and changing data types, for data engineers.
A technical guide on processing Overture Maps' global land cover dataset, focusing on extracting and analyzing Australia's data using DuckDB and QGIS.
A tutorial on using PyArrow for data analytics in Python, covering core concepts, file I/O, and analytical operations.
Exploring Japan's building footprint data from the Flateau project, which converts 3D CityGML data into 2D Parquet files for analysis.
A benchmark analysis of DuckDB's performance on a massive 1.1 billion row NYC taxi dataset, comparing it to other database technologies.
A no-code tutorial on converting XLS/CSV files to Parquet format using Dremio, including setup via Docker.
A guide to building a cost-effective, high-performance, and self-service data lakehouse architecture, addressing common pitfalls and outlining key principles.
A guide comparing popular data compression codecs (zstd, brotli, lz4, gzip, snappy) for Parquet files, explaining their trade-offs for big data.
A quick guide on using DuckDB's SQL commands to efficiently convert CSV files to the Parquet format, including schema modifications.
How to handle mismatched Parquet file schemas when querying multiple files in DuckDB using the UNION_BY_NAME option.
Introduction to Apache Drill, a SQL engine for querying diverse data sources like files (CSV, JSON) and databases.