Using Spark from R for performance with arbitrary code - Part 2 - Constructing functions by piping dplyr verbs

Introduction

In the first part of this series, we looked at how the sparklyr interface communicates with the Spark instance and what this means for performance with regards to arbitrarily defined R functions. We also examined how Apache Arrow can increase the performance of data transfers between the R session and the Spark instance.

In this second part, we will look at how to write R functions that can be executed directly by Spark without serialization overhead that …