Analysis-Ready OpenStreetMap
Exploring the Layercake project's analysis-ready OpenStreetMap data in Parquet format, including setup and performance on a high-end workstation.
Exploring the Layercake project's analysis-ready OpenStreetMap data in Parquet format, including setup and performance on a high-end workstation.
Analysis of a new global building dataset (2.75B structures), detailing the data processing, technical setup, and tools used for exploration.
Explores workflow orchestration in data engineering, covering DAGs, tools, and best practices for managing complex data pipelines.
Explains core data engineering concepts, comparing ETL and ELT data pipeline strategies and their use cases.
A guide to building a data pipeline using DuckDB, covering data ingestion, transformation, and analytics with real-world environmental data.
A data engineer reflects on their 2-year career journey at the City of Boston, sharing lessons learned in data warehousing, ETL, and civic tech.
Explores three types of data change events in Change Data Capture (CDC): Full, Delta, and Id-only events, detailing their structure and use cases.
Explores a taxonomy of data change events in CDC, detailing Full, Delta, and Id-only events and their use cases.
An introduction to Data Vault modeling, a flexible data warehouse design method using Hubs, Links, and Satellites for scalable data integration.
A weekly tech learning digest covering Microsoft Fabric, AI topics, computer vision, Azure AI Document Intelligence, embeddings, and vector search.
A technical guide on using ClickHouse to export PostgreSQL data to Parquet format for faster loading into Google BigQuery.
Explains the evolution from ETL to ELT in data engineering, clarifying the role of modern tools like dbt in the transformation process.
A guide to using RAPIDS to accelerate ETL and data processing workflows within a KubeFlow environment by leveraging GPUs.
A technical guide on running RSQL for Redshift within an AWS Fargate container, including setup, configuration, and containerization steps.
Explains the differences between batch and streaming data processing, covering OLTP, OLAP, and ETL concepts for data engineers.
Analysis of an AWS serverless ETL pattern using EventBridge, Lambda, Fargate, and S3 to process CSV files into DynamoDB.
A former Application DBA shares advanced SQL and database optimization techniques for developers, focusing on performance and efficiency.
Explains why Apache Airflow jobs appear to run a day late due to its scheduling logic, contrasting it with cron jobs.
Explores a new feature in SQL Server 2019's SET STATISTICS IO output, revealing detailed I/O metrics for INSERT operations into target tables.
A technical deep dive into solving PostgreSQL disk space issues by optimizing a deduplication query, focusing on reducing sort key size.