RevoScaleR Functions for Spark on Hadoop

The RevoScaleR package provides a set of portable, scalable, distributable data analysis functions. This page presents a curated list of functions that might be particularly interesting to Hadoop users. These functions can be called directly from the command line.

The RevoScaleR package supports two Hadoop compute contexts:

  • RxSpark (recommended), a distributed compute context in which computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark. This provides up to a 7x performance boost compared to RxHadoopMR. For guidance, see How to use RevoScaleR on Spark.

  • RxHadoopMR (deprecated), a distributed compute context on a Hadoop cluster. This compute context can be used on a node (including an edge node) of a Cloudera or Hortonworks cluster with a RHEL operating system, or a client with an SSH connection to such a cluster. For guidance, see How to use RevoScaleR on Hadoop MapReduce.

On Hadoop Distributed File System (HDFS), the XDF file format stores data in a composite set of files rather than a single file.

Data Analysis Functions

Import and Export Functions


Function Name Description
Help
rxDataStep
-
Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output data) from an .xdf file or a data frame.
RxXdfData
-
Creates an efficient XDF data source object.
RxTextData
-
Creates a comma delimited text data source object.
rxGetInfo
-
Retrieves summary information from a data source or data frame.
rxGetVarInfo Retrieves variable information from a data source or data frame.
rxGetVarNames Retrieves variable names from a data source or data frame.
rxHdfsFileSystem Creates an HDFS file system object.

#### Manipulation, Cleansing, and Transformation Functions
Function Name Description
Help
rxDataStep
-
Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output) from an .xdf file or a data frame.
rxFactors
-
Create or recode factor variables in a composite XDF file in HDFS. A new file must be written out.

#### Analysis Functions for Descriptive Statistics and Cross-Tabulations
Function Name Description
Help
rxQuantile
-
Computes approximate quantiles for .xdf files and data frames without sorting.
rxSummary
-
Basic summary statistics of data, including computations by group. Writing by group computations to .xdf file not supported.
rxCrossTabs
-
Formula-based cross-tabulation of data.
rxCube
-
Alternative formula-based cross-tabulation designed for efficient representation returning ‘cube’ results. Writing output to .xdf file not supported.


#### Analysis, Learning, and Prediction Functions for Statistical Modeling
Function Name Description
Help
rxLinMod
-
Fits a linear model to data.
rxLogit
-
Fits a logistic regression model to data.
rxGlm
-
Fits a generalized linear model to data.
rxCovCor
-
Calculate the covariance, correlation, or sum of squares / cross-product matrix for a set of variables.
rxDTree
-
Fits a classification or regression tree to data.
rxBTrees
-
Fits a classification or regression decision forest to data using a stochastic gradient boosting algorithm.
rxDForest
-
Fits a classification or regression decision forest to data.
rxPredict
-
Calculates predictions for fitted models. Output must be an XDF data source.
rxKmeans
-
Performs k-means clustering.
rxNaiveBayes
-
Fit Naive Bayes Classifiers on an .xdf file or data frame for small or large data using parallel external memory algorithm.

Compute Context Functions

Function Name Description
Help
RxHadoopMR
-
Creates an in-data, file-based Hadoop compute context.
RxSpark
-
Creates an in-data, file-based Spark compute context. Computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark.
rxSparkConnect Creates a persistent Spark compute context.
rxSparkDisconnect Disconnects a Spark session and return to a local compute context.
rxInstalledPackages Returns the list of installed packages for a compute context.
rxFindPackage Returns the path to one or more packages for a compute context.

Data Source Functions

Of course, not all data source types are available on all compute contexts. For the Hadoop compute contexts, two types of data sources can be used.


Function Name Description
Help
RxXdfData
-
Creates an efficient XDF data source object.
RxTextData
-
Creates a comma delimited text data source object.
RxHiveData Generates a Hive Data Source object.
RxParquetData Generates a Parquet Data Source object.
rxSparkDataOps Lists cached RxParquetData or RxHiveData data source objects.
rxSparkRemoveData Removes cached RxParquetData or RxHiveData data source objects.

## High Performance Computing and Distributed Computing Functions

The Hadoop compute context has a number of helpful functions used for high performance computing and distributed computing. Learn more about the entire set of functions in the Distributed Computing guide.

Function Name Description
Help
rxExec Run an arbitrary R function on nodes or cores of a cluster.
rxGetJobStatus Get the status of a non-waiting distributed computing job.
rxGetJobResults Get the return object(s) of a non-waiting distributed computing job.
rxGetJobOutput Get the console output from a non-waiting distributed computing job.
rxGetJobs Get the available distributed computing job information objects.

## Hadoop Convenience Functions

RevoScaleR also provides some wrapper functions for accessing Hadoop/HDFS functionality via R. These functions require access to Hadoop, either locally or remotely via the RxHadoopMR or RxSpark compute contexts.

Function Name Description
Help
rxHadoopCommand Execute an arbitrary Hadoop command. Allows you to run basic Hadoop commands.
rxHadoopVersion Return the current Hadoop version.
rxHadoopCopyFromClient Copy a file from a remote client to the Hadoop cluster's local file system, and then to HDFS.
rxHadoopCopyFromLocal Copy a file from the native file system to HDFS. Wraps the Hadoop fs -copyFromLocal command.
rxHadoopCopy Copy a file in the Hadoop Distributed File System (HDFS). Wraps the Hadoop fs -cp command.
rxHadoopRemove Remove a file in HDFS. Wraps the Hadoop fs -rm command.
rxHadoopListFiles List files in an HDFS directory. Wraps the Hadoop fs -ls or fs -lsr command.
rxHadoopMakeDir Make a directory in HDFS. Wraps the Hadoop fs -mkdir command.
rxHadoopMove Move a file in HDFS. Wraps the Hadoop fs -mv command.
rxHadoopRemoveDir Remove a directory in HDFS. Wraps the Hadoop fs -rmr command.