*implyr**: A **dplyr** Backend for a Apache Impala

with Ian Cook

useR!2017: *implyr**: A **dplyr** Backend for a Apa...

Keywords: Tidyverse, dplyr, SQL, Apache Impala, Big Data
Webpages: https://CRAN.R-project.org/package=implyr
This talk introduces implyr, a new dplyr backend for Apache Impala (incubating). I compare the features and performance of implyr to that of dplyr backends for other distributed query engines including sparklyr for Apache Spark's Spark SQL, bigrquery for Google BigQuery, and RPresto for Presto.
Impala is a massively parallel processing query engine that enables low-latency SQL queries on data stored in the Hadoop Distributed File System (HDFS), Apache HBase, Apache Kudu, and Amazon Simple Storage Service (S3). The distributed architecture of Impala enables fast interactive queries on petabyte-scale data, but it imposes limitations on the dplyr interface. For example, row ordering of a result set must be performed in the final phase of query processing. I describe the methods used to work around this and other limitations.
Finally, I discuss broader issues regarding the DBI-compatible interfaces that dplyr requires for underlying connectivity to database sources. implyr is designed to work with any DBI-compatible interface to Impala, such as the general packages odbc and RJDBC, whereas other dplyr database backends typically rely on one particular package or mode of connectivity.