All About Parquet Part 06 - Encoding in Parquet | Optimizing for Storage
Explains encoding techniques in Parquet files, including dictionary, RLE, bit-packing, and delta encoding, to optimize storage and performance.
Alex Merced — Developer and technical writer sharing in-depth insights on data engineering, Apache Iceberg, data lakehouse architectures, Python tooling, and modern analytics platforms, with a strong focus on practical, hands-on learning.
501 articles from this blog
Explains encoding techniques in Parquet files, including dictionary, RLE, bit-packing, and delta encoding, to optimize storage and performance.
Explores how metadata in Parquet files improves data efficiency and query performance, covering file, row group, and column-level metadata.
A practical guide to reading and writing Parquet files in Python using PyArrow and FastParquet libraries.
Explores why Parquet is the ideal columnar file format for optimizing storage and query performance in modern data lake and lakehouse architectures.
Final guide in a series covering performance tuning and best practices for optimizing Apache Parquet files in big data workflows.
Explores using GitHub Actions for software development CI/CD and advanced data engineering tasks like ETL pipelines and data orchestration.
Using GitHub Actions to trigger Airflow DAGs for orchestrating data pipelines across Spark, Dremio, and Snowflake.
A guide explaining dbt macros, their purpose, benefits, and how to use them to write reusable, standardized SQL code in data transformation projects.
Quarterly roundup of data lakehouse trends, table formats, and major industry news from Apache Iceberg to Delta Lake.
A tutorial on using PyArrow for data analytics in Python, covering core concepts, file I/O, and analytical operations.
A comprehensive guide to using Rust's built-in collection types, including vectors, arrays, hashmaps, and sets, with performance tips and examples.
Explains how to implement access control and security for Apache Iceberg tables at the file, engine, and catalog levels.
A guide to performing data operations using PySpark, Pandas, DuckDB, Polars, and DataFusion within a pre-configured Docker environment.
A comprehensive directory of Apache Iceberg resources, including tutorials, guides, and educational materials for data engineers and developers.
Explores how combining data lakehouse, virtualization, and mesh architectures with Dremio solves modern data scaling and silo challenges.
A comprehensive guide to building interactive data applications using the Streamlit framework, covering setup, visualization, ML integration, and deployment.
A comprehensive guide to Docker Compose, covering file structure, service configuration, networking, volumes, and best practices for multi-container applications.
A comprehensive guide to string handling in Rust, covering types, conversions, operations, and performance best practices.
An introductory guide to Rust, covering its key features like memory safety, ownership, and setup for developers new to the language.
A hands-on tutorial for building a Data Lakehouse on your laptop using Apache Iceberg, Spark, Nessie, Minio, and Dremio.