The Basics of Compaction — Bin Packing Your Data for Efficiency
Explains data compaction using bin packing in Apache Iceberg to merge small files, improve query performance, and reduce metadata overhead.
Explains data compaction using bin packing in Apache Iceberg to merge small files, improve query performance, and reduce metadata overhead.
A hands-on tutorial on building a data lakehouse pipeline using Spark, Dremio, and Superset to move and analyze data.
A hands-on tutorial for setting up a Docker environment to experiment with the Apache Iceberg table format using Spark SQL.
Guide on configuring an external Apache Hive metastore with Azure SQL for use in an Azure Synapse Analytics Spark Pool, troubleshooting common connection errors.
Practical strategies for staying current in the fast-moving field of machine learning, including project experimentation and community engagement.
Notes from Spark+AI Summit 2020 covering application-specific talks on ML frameworks, data engineering, feature stores, and data quality from companies like Airbnb and Netflix.
Explores SQL-on-Hadoop engines like Apache Drill for analyzing ETL data processed with Spark on Amazon EMR, focusing on performance and flexibility.
Final summary of a project exploring ETL offload to Apache Spark on AWS EMR, evaluating cost and tech benefits for a cloud-based data platform.
Part 2 of a guide on developing ETL processes using Apache Spark, Jupyter Notebooks, and Docker on Amazon EMR.
Explores using Apache Spark on Amazon EMR to offload and improve ETL processes, comparing it to traditional Oracle-based solutions.
A data scientist reviews Martin Odersky's Functional Programming in Scala Coursera course, covering key learnings and its practical application.
A former PhD scientist shares his positive transition to data science freelancing, detailing the freedom and variety of his new career.
A deep-dive technical guide into Laravel Spark, an alpha-release tool for quickly building SaaS applications with Laravel.