A guide to retrieval and processing of data from relational database systems using Apache Spark and JDBC with R and sparklyr

Introduction

The {sparklyr} package lets us connect and use Apache Spark for high-performance, highly parallelized, and distributed computations. We can also use Spark’s capabilities to improve and streamline our data processing pipelines, as Spark supports reading and writing from many popular sources such as Parquet, Orc, etc. and most database systems via JDBC drivers.

In this post, we will explore using R to perform data loads to Spark and optionally R from relational database management systems …