PySpark 101: Introduction to Big Data with Spark
A beginner-friendly introduction to using PySpark for big data processing with Apache Spark, covering the fundamentals.
A beginner-friendly introduction to using PySpark for big data processing with Apache Spark, covering the fundamentals.
A comprehensive 2025 guide to Apache Iceberg, covering its architecture, ecosystem, and practical use for data lakehouse management.
Explores why Parquet is the ideal columnar file format for optimizing storage and query performance in modern data lake and lakehouse architectures.
An introduction to Apache Parquet, a columnar storage file format for efficient data processing and analytics.
Explains the hierarchical structure of Parquet files, detailing how pages, row groups, and columns optimize storage and query performance.
Final guide in a series covering performance tuning and best practices for optimizing Apache Parquet files in big data workflows.
An introduction to Apache Iceberg, a table format for data lakehouses, explaining its architecture and providing learning resources.
Compares partitioning techniques in Apache Hive and Apache Iceberg, highlighting Iceberg's advantages for query performance and data management.
A guide to configuring Apache Spark for use with the Apache Iceberg table format, covering packages, flags, and programmatic setup.
Argues that raw data is overvalued without proper context and conversion into meaningful information and knowledge.
Explains the APPROX_COUNT_DISTINCT function for faster, memory-efficient distinct counts in SQL, comparing it to exact COUNT(DISTINCT).
Explains improvements in joblib's compressed persistence for Python, focusing on reduced memory usage and single-file storage for large numpy arrays.
Technical guide on building a real-time Twitter sentiment analysis system using Apache Kafka and Storm.
Explains Lambda Architecture for Big Data, combining batch processing (Hadoop) and real-time stream processing (Spark, Storm) to handle large datasets.