Building pyarrow with CUDA support
A step-by-step guide to building the pyarrow Python library with CUDA support using Docker on Ubuntu for GPU data processing.
Randy Zwitch is a software engineer specializing in Python and data engineering. His blog features detailed tutorials on building and optimizing Python tools like PyArrow with GPU/CUDA support, Docker workflows, and high-performance data processing.
96 articles from this blog
A step-by-step guide to building the pyarrow Python library with CUDA support using Docker on Ubuntu for GPU data processing.
A beginner's guide to using BenchmarkTools.jl for performance optimization in Julia, with practical examples and common pitfalls.
RSiteCatalyst v1.4.16 released with a community-contributed fix for using multiple classification columns in QueueDataWarehouse.
A talk on building a real-time telemetry data pipeline using StreamSets, Apache Kafka, and OmniSciDB for analytics.
A look back at a 2018 PyData talk on end-to-end GPU data science workflows using OmniSci and RAPIDS, highlighting concepts still relevant today.
A technical guide on fixing timestamp corruption in CSV data using pandas and uploading the corrected data to OmniSci using pymapd.
A technical guide on using lftp and cron to automatically mirror a large FTP dataset to a local VM for processing.
A technical guide on installing OmniSci (MapD) ODBC drivers and establishing a database connection within RStudio Server on an Azure Ubuntu VM.
Explores GPU-based data science workflows using MapD (now OmniSci) for high-performance analytics and machine learning without data transfer bottlenecks.
A guide to converting many .zip files to .gz format in parallel using a command-line one-liner for efficient disk usage.
A technical guide on loading multiple geospatial shapefiles into a Postgres/PostGIS database using shell commands and data preparation techniques.
A technical guide on integrating Adobe Analytics data into Microsoft Power BI dashboards using the RSiteCatalyst R package and its API.
A technical tutorial on using Python and pandas to process electricity data and load it into OmniSci (formerly MapD) for dashboard creation.
RSiteCatalyst v1.4.14 release notes detailing a single bug fix and encouraging community contributions via GitHub.
A tutorial on installing OmniSci (formerly MapD) using Docker and loading data for GPU-accelerated SQL analytics and visualization.
A tutorial on using Julia's CUDAnative.jl package to achieve 20x speedups by parallelizing haversine distance calculations on an NVIDIA GPU.
RSiteCatalyst v1.4.13 fixes an OAUTH2 authentication bug reported by a community member. A minor, cumulative update.
Release notes for RSiteCatalyst versions 1.4.11 and 1.4.12, detailing new methods, bug fixes, and community contributions.
Guide to setting up and using Adobe Analytics' self-service Data Feed feature for customer-level analytics via FTP/SFTP/S3.
A guide to building a custom data science workstation for GPU computing and Docker, including specs, assembly, and a later upgrade.