Big Data articles

10/13/2018 • EN

Approximate Distinct Count

Explains the APPROX_COUNT_DISTINCT function for faster, memory-efficient distinct counts in SQL, comparing it to exact COUNT(DISTINCT).

algorithm Approximate Distinct Count Big Data data processing Hyperloglog

Niko Neugebauer

3/5/2018 • EN

Faster generalised linear models in largeish data

A method for faster generalized linear models on large datasets using a single database query and one Newton-Raphson iteration.

Big Data Generalized Linear Models optimization R Statistical Computing

Thomas Lumley

2/1/2018 • EN

Getting Started With OmniSci, Part 1: Docker Install and Loading Data

A tutorial on installing OmniSci (formerly MapD) using Docker and loading data for GPU-accelerated SQL analytics and visualization.

Big Data data visualization docker Gpu Computing SQL Analytics

Randy Zwitch

11/21/2017 • EN

Installing Oracle GoldenGate for Big Data 12.3.1 with Kafka Connect and Confluent Platform

A technical guide on installing and configuring Oracle GoldenGate for Big Data with Kafka Connect and Confluent Platform.

Apache Kafka Big Data Confluent Platform Data Integration Oracle Goldengate

Robin Moffatt

9/20/2017 • EN

Apache Kafka™ talks at Oracle OpenWorld, JavaOne, and Oak Table World 2017

A list of 19 Apache Kafka-related technical sessions at Oracle OpenWorld, JavaOne, and Oak Table World 2017 conferences.

Apache Kafka Big Data Data Pipeline Microservices Stream Processing

Robin Moffatt

12/20/2016 • EN

ETL Offload with Spark and Amazon EMR - Part 5 - Summary

Final summary of a project exploring ETL offload to Apache Spark on AWS EMR, evaluating cost and tech benefits for a cloud-based data platform.

Amazon Emr aws Big Data Etl Spark

Robin Moffatt

10/17/2016 • EN

Is Privacy Dead?

A personal reflection on the trade-offs between convenience and privacy in an era of AI, IoT, and pervasive data collection.

artificial intelligence Big Data cybersecurity Internet Of Things privacy

Carlos Mendible

9/6/2016 • EN

Using logdump to Troubleshoot the Oracle GoldenGate for Big Data Kafka Handler

Troubleshooting Oracle GoldenGate for Big Data Kafka Handler errors using logdump and debug logs.

Big Data Java Kafka Connect Oracle Goldengate Troubleshooting

Robin Moffatt

8/11/2016 • EN

An Introduction to Apache Drill

Introduction to Apache Drill, a SQL engine for querying diverse data sources like files (CSV, JSON) and databases.

Apache Drill Big Data Data Sources Parquet SQL Interface

Robin Moffatt

7/14/2016 • EN

Using R with Jupyter Notebooks and Oracle Big Data Discovery

Guide on using R within Jupyter Notebooks to analyze and manipulate datasets in Oracle Big Data Discovery, enabling advanced statistical workflows.

Big Data Data Science Jupyter Notebooks Oracle Big Data Discovery R

Robin Moffatt

5/20/2016 • EN

Better Python compressed persistence in joblib

Explains improvements in joblib's compressed persistence for Python, focusing on reduced memory usage and single-file storage for large numpy arrays.

Big Data compression Joblib Persistence Python

Gael Varoquaux

2/7/2016 • EN

Sentiment analysis of tweets

Technical guide on building a real-time Twitter sentiment analysis system using Apache Kafka and Storm.

Apache Kafka Apache Storm Big Data Real Time Processing Sentiment Analysis

Marçal Serrate

12/13/2015 • EN

Big Data: streams and lambdas

Explains Lambda Architecture for Big Data, combining batch processing (Hadoop) and real-time stream processing (Spark, Storm) to handle large datasets.

Batch Processing Big Data Hadoop Lambda Architecture Stream Processing

Marçal Serrate

10/28/2015 • EN

Forays into Kafka - Enabling Flexible Data Pipelines

Explores using Apache Kafka to create flexible, testable data pipelines, enabling multiple parallel consumers and safe experimentation.

Apache Kafka Big Data Data Integration Data Pipelines Stream Processing

Robin Moffatt

10/25/2015 • EN

Building Data Pipelines with Microsoft Azure Data Factory

A tutorial on building data pipelines using Microsoft Azure Data Factory, covering ingestion, transformation, and orchestration.

Azure Data Factory Big Data cloud computing Data Pipelines Etl

Rahul Rai

11/4/2014 • EN

Analytics with Kibana and Elasticsearch through Hadoop - part 3 - Visualising the data in Kibana

Part 3 of a series on visualizing data in Kibana after loading it from Hadoop via Elasticsearch, comparing Kibana 3 and 4.

Big Data data visualization Elasticsearch Hadoop Kibana

Robin Moffatt

11/3/2014 • EN

Analytics with Kibana and Elasticsearch through Hadoop - part 1 - Introduction

Introduction to using Kibana and Elasticsearch with Hadoop for visualizing and analyzing big data from web logs, tweets, and blog metadata.

Big Data data visualization Elasticsearch Hadoop Kibana

Robin Moffatt

8/22/2014 • EN

Hacking Academia: Data Science and the University

A reflection on the challenges of data science in academia, discussing the 'brain drain' of data skills and the need for systemic change.

Academia Big Data conference Data Science Research

Jake VanderPlas

6/12/2014 • EN

Five Hard-Won Lessons Using Hive

A data engineer shares five practical lessons and performance tips for working with Apache Hive, focusing on common pitfalls and optimizations.

Big Data Data Engineering Hadoop Hive sql

Randy Zwitch

5/2/2014 • EN

MongoDB Connector for Hadoop with Authentication - Quick Tip

Fixing MongoDB Connector for Hadoop authentication errors by granting the clusterManager role to the user.

authentication Big Data Connector Hadoop mongodb

Paul Done

Big Data Articles

Approximate Distinct Count

Faster generalised linear models in largeish data

Getting Started With OmniSci, Part 1: Docker Install and Loading Data

Installing Oracle GoldenGate for Big Data 12.3.1 with Kafka Connect and Confluent Platform

Apache Kafka™ talks at Oracle OpenWorld, JavaOne, and Oak Table World 2017

ETL Offload with Spark and Amazon EMR - Part 5 - Summary

Is Privacy Dead?

Using logdump to Troubleshoot the Oracle GoldenGate for Big Data Kafka Handler

An Introduction to Apache Drill

Using R with Jupyter Notebooks and Oracle Big Data Discovery

Better Python compressed persistence in joblib

Sentiment analysis of tweets

Big Data: streams and lambdas

Forays into Kafka - Enabling Flexible Data Pipelines

Building Data Pipelines with Microsoft Azure Data Factory

Analytics with Kibana and Elasticsearch through Hadoop - part 3 - Visualising the data in Kibana

Analytics with Kibana and Elasticsearch through Hadoop - part 1 - Introduction

Hacking Academia: Data Science and the University

Five Hard-Won Lessons Using Hive

MongoDB Connector for Hadoop with Authentication - Quick Tip

Select Language

We use cookies