All About Parquet Part 08 - Reading and Writing Parquet Files in Python
A practical guide to reading and writing Parquet files in Python using PyArrow and FastParquet libraries.
A practical guide to reading and writing Parquet files in Python using PyArrow and FastParquet libraries.
An introduction to Apache Parquet, a columnar storage file format for efficient data processing and analytics.
Explains the hierarchical structure of Parquet files, detailing how pages, row groups, and columns optimize storage and query performance.
Explains how Parquet handles schema evolution, including adding/removing columns and changing data types, for data engineers.
Explains encoding techniques in Parquet files, including dictionary, RLE, bit-packing, and delta encoding, to optimize storage and performance.
Explores why Parquet is the ideal columnar file format for optimizing storage and query performance in modern data lake and lakehouse architectures.
Explores how metadata in Parquet files improves data efficiency and query performance, covering file, row group, and column-level metadata.
Final guide in a series covering performance tuning and best practices for optimizing Apache Parquet files in big data workflows.
Explains Parquet's columnar storage model, detailing its efficiency for big data analytics through faster queries, better compression, and optimized aggregation.
Explores compression algorithms in Parquet files, comparing Snappy, Gzip, Brotli, Zstandard, and LZO for storage and performance.
A technical guide comparing spatial patterns in continuous raster data for overlapping regions using R, focusing on NDVI data analysis.
Using GitHub Actions to trigger Airflow DAGs for orchestrating data pipelines across Spark, Dremio, and Snowflake.
Explores using GitHub Actions for software development CI/CD and advanced data engineering tasks like ETL pipelines and data orchestration.
A practical guide to structuring Go projects, advocating for simplicity over rigid conventions and explaining when to use or avoid common directory patterns.
Podcast interview with Gorkem Ercan discussing Eclipse Foundation, AI/ML adoption in enterprises, CI/CD practices, and open source development.
Security audit results for vdirsyncer reveal four minor findings, including file permissions and error handling issues, with fixes implemented.
Explores the future of PostgreSQL, focusing on the power of extensions like pg_stat_statements, Citus, and pg_search to add new capabilities.
A guide explaining dbt macros, their purpose, benefits, and how to use them to write reusable, standardized SQL code in data transformation projects.
A summary of HashiConf 2024, covering major announcements like Terraform Stacks and the event's focus on Infrastructure and Security Lifecycle Management.
A discussion on the proposed behavior and limitations of the new <selectedoption> element for styling HTML select elements.