Using SQL Workbench with Apache Hive
A tutorial on connecting to Apache Hive using the open-source SQL Workbench tool via JDBC, covering driver setup and connection configuration.
Randy Zwitch is a software engineer specializing in Python and data engineering. His blog features detailed tutorials on building and optimizing Python tools like PyArrow with GPU/CUDA support, Docker workflows, and high-performance data processing.
96 articles from this blog
A tutorial on connecting to Apache Hive using the open-source SQL Workbench tool via JDBC, covering driver setup and connection configuration.
A guide to using the RSiteCatalyst R package to access Adobe Analytics real-time reporting API for monitoring metrics like orders and revenue.
RSiteCatalyst v1.3 adds regex search, Realtime API support, and configurable request timing for the Adobe Analytics R package.
Final tutorial on analyzing airline data with Hadoop using Hive for SQL queries and Pig for scripting, covering setup and basic analytics.
A tutorial on manually creating dummy variables in R to handle categorical data with many levels, addressing a common randomForest package error.
A guide to automatically generate Adobe Analytics implementation documentation using the RSiteCatalyst R package and the Adobe API.
A guide to setting up a remote IPython Notebook server on Amazon EC2 for data science and analytics.
Two methods to add line numbers to cells in IPython/Jupyter Notebooks: a keyboard shortcut toggle and a permanent startup configuration.
RSiteCatalyst v1.2 is released on CRAN with bug fixes, dependency removal, and improved numeric type handling for the Adobe Analytics API.
A technical guide on using K-Means clustering in R to analyze and segment search keywords for understanding user intent in digital analytics.
A benchmark comparison of Julia, Python, R, and pqR on a Project Euler problem, exploring performance gains from JIT compilation.
RSiteCatalyst 1.1 released with new API features, faster calls, and extended timeout for Adobe Analytics data in R.
A tutorial on using Apache Hive to create tables and views from data loaded into a Hadoop cluster, continuing a multi-part series.
Explains how to use the Adobe Analytics API and R for statistical anomaly detection in time-series marketing data.
A guide to reading and writing tabular data in Julia using arrays, DataFrames, and ODBC database connections.
A technical guide on using Python, mrjob, and Amazon EMR for Hadoop Streaming to perform large-scale, parallel URL classification.
An introduction to the Julia programming language for scientific computing, covering installation, package management, and basic syntax comparisons.
Tutorial on loading data into Hadoop's HDFS using the Hue File Browser interface and the Airline Dataset.
Argues that true data science and innovation require deep mathematical understanding, not just push-button tools, and defends the value of skilled data scientists.
A tutorial on installing and configuring an 18-node Hadoop cluster on Amazon EC2 using Cloudera Manager.