RevoScaleR package
The RevoScaleR library is a collection of portable, scalable, and distributable R functions for importing, transforming, and analyzing data at scale. You can use it for descriptive statistics, generalized linear models, k-means clustering, logistic regression, classification and regression trees, and decision forests.
Functions run on the RevoScaleR interpreter, built on open-source R, engineered to leverage the multithreaded and multinode architecture of the host platform.
Package details | Description |
---|---|
Current version: | 9.4 |
Built on: | R 3.5.2 |
Package distribution: | Machine Learning Server R Client (Windows and Linux) R Server 9.1 and earlier SQL Server 2016 and later (Windows only) Azure HDInsight Azure Data Science Virtual Machines |
How to use RevoScaleR
The RevoScaleR library is found in Machine Learning Server and Microsoft R products. You can use any R IDE to write R script calling functions in RevoScaleR, but the script must run on a computer having the interpreter and libraries.
RevoScaleR is often preloaded into tools that integrate with Machine Learning Server and R Client, which means you can call functions without having to load the library. If the library is not loaded, you can load RevoScaleR from the command line by typing library(RevoScaleR)
.
Run it locally
This is the default. RevoScaleR runs locally on all platforms, including R Client. On a standalone Linux or Windows system, data and operations are local to the machine. On Hadoop, a local compute context means that data and operations are local to current execution environment (typically, an edge node).
Run in a remote compute context
RevoScaleR runs remotely on computers that have a server installation. In a remote compute context, the script running on a local R Client or Machine Learning Server shifts execution to a remote Machine Learning Server. For example, script running on Windows might shift execution to a Spark cluster to process data there.
On distributed platforms, such as Hadoop processing frameworks (Spark and MapReduce), set the compute context to RxSpark or RxHadoopMR and give the cluster name. In this context, if you call a function that can run in parallel, the task is distributed across data nodes in the cluster, where the operation is co-located with the data.
On SQL Server, set the compute context to RxInSQLServer. There are two primary use cases for remote compute context:
Call R functions in T-SQL script or stored procedures running on SQL Server.
Call RevoScaleR functions in R script executing in a SQL Server compute context. In your script, you can set a compute context to shift execution of RevoScaleR operations to a remote SQL Server instance that has the RevoScaleR interpreter.
Some functions in RevoScaleR are specific to particular compute contexts. A filtered list of functions includes the following:
Typical workflow
Whenever you want to perform an analysis using RevoScaleR
functions, you should specify three distinct pieces of information:
- The analytic function, which specifies the analysis to be performed
- The compute context, which specifies where the computations should take place
- The data source, which is the data to be used
Functions by category
The library includes data transformation and manipulation, visualization, predictions, and statistical analysis functions. It also includes functions for controlling jobs, serializing data, and performing common utility tasks.
This section lists the functions by category to give you an idea of how each one is used. The table of contents lists functions in alphabetical order.
Note
Some function names begin with rx
and others with Rx
. The Rx
function name prefix is used for class constructors for data sources and compute contexts.
1-Data source functions
Function name | Description |
---|---|
RxXdfData | Creates an efficient XDF data source object. |
RxTextData | Creates a comma-delimited text data source object. |
RxSasData | Creates a SAS data source object. |
RxSpssData | Creates an SPSS data source object. |
RxOdbcData | Creates an ODBC data source object. |
RxTeradata | Creates a Teradata data source object. |
RxSqlServerData | Creates a SQL Server data source object. |
2-Import and save-as
Function name | Description |
---|---|
rxImport * | Creates an .xdf file or data frame from a data source (for example, text, SAS, SPSS data files, ODBC or Teradata connection, or data frame). |
rxDataStep * | Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output data) from an .xdf file or a data frame. |
rxGetInfo * | Retrieves summary information from a data source or data frame. |
rxSetInfo * | Sets a file description in an .xdf file or a description attribute in a data frame. |
rxGetVarInfo | Retrieves variable information from a data source or data frame. |
rxSetVarInfo | Modifies variable information in an .xdf file or data frame. |
rxGetVarNames | Retrieves variable names from a data source or data frame. |
rxCreateColInfo | Generates a colInfo list from a data source. |
rxCompressXdf | Compresses an existing .xdf file, or a directory of .xdf files. |
rxIsOpen | Indicates whether a data source can be accessed. |
rxOpen | Opens a data source for reading. |
rxClose | Closes a data source. |
rxReadNext | Read data from a source. |
rxWriteNext | Writes the next chunk when moving data between RevoScaleR data sources. |
rxSetFileSystem | Specify a file system type for data for import. |
rxGetFileSystem | Retrieve the current file system type. |
rxHdfsFileSystem | Creates an HDFS file system object. |
rxNativeFileSystem | Creates a native file system object. |
rxSqlServerDropTable | Execute an SQL statement that drops a table. |
rxSqlServerTableExists | Execute an SQL statement that checks for a table's existence. |
* Signifies the most popular functions in this category.
3-Data transformation
Function name | Description |
---|---|
rxDataStep * | Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output) from an .xdf file or a data frame. |
rxFactors * | Recode a factor variable or convert non-factor variable into a factor in an .xdf file or data frame. |
rxGetFuzzyDist | Get fuzzy distances for a character vector. |
rxGetFuzzyKeys | Get fuzzy keys for a character vector. |
rxSplit | Splits an .xdf file or data frame into multiple .xdf files or data frames. |
rxSort | Multi-key sorting of the variables an .xdf file or data frame. |
rxMerge | Merges two .xdf files or data frames using various merge types. |
rxExecuteSQLDDL | SQL Server R Services only. Runs an arbitrary SQL DDL command. |
* Signifies the most popular functions in this category.
4-Graphing functions
Function name | Description |
---|---|
rxHistogram | Creates a histogram from data. |
rxLinePlot | Creates a line plot from data. |
rxLorenz | Computes a Lorenz curve that can be plotted. |
rxRocCurve | Computes and plots ROC curves from actual and predicted data. |
5-Descriptive statistics
Function name | Description |
---|---|
rxQuantile * | Computes approximate quantiles for .xdf files and data frames without sorting. |
rxSummary * | Basic summary statistics of data, including computations by group. Writing by group computations to .xdf file not supported. |
rxCrossTabs * | Formula-based cross-tabulation of data. |
rxCube * | Alternative formula-based cross-tabulation designed for efficient representation returning cube results. Writing output to .xdf file not supported. |
rxMarginals | Marginal summaries of cross-tabulations. |
as.xtabs | Converts cross tabulation results to an xtabs object. |
rxChiSquaredTest | Performs Chi-squared Test on xtabs object. Used with small data sets and does not chunk data. |
rxFisherTest | Performs Fisher's Exact Test on xtabs object. Used with small data sets and does not chunk data. |
rxKendallCor | Computes Kendall's Tau Rank Correlation Coefficient using xtabs object. |
rxPairwiseCrossTab | Apply a function to pairwise combinations of rows and columns of an xtabs object. |
rxRiskRatio | Calculate the relative risk on a two-by-two xtabs object. |
rxOddsRatio | Calculate the odds ratio on a two-by-two xtabs object. |
* Signifies the most popular functions in this category.
6-Prediction functions
Function name | Description |
---|---|
rxLinMod * | Fits a linear model to data. |
rxLogit * | Fits a logistic regression model to data. |
rxGlm * | Fits a generalized linear model to data. |
rxCovCor * | Calculate the covariance, correlation, or sum of squares / cross-product matrix for a set of variables. |
rxDTree * | Fits a classification or regression tree to data. |
rxBTrees * | Fits a classification or regression decision forest to data using a stochastic gradient boosting algorithm. |
rxDForest * | Fits a classification or regression decision forest to data. |
rxPredict * | Calculates predictions for fitted models. Output must be an XDF data source. |
rxKmeans * | Performs k-means clustering. |
rxNaiveBayes * | Performs Naive Bayes classification. |
rxCov | Calculate the covariance matrix for a set of variables. |
rxCor | Calculate the correlation matrix for a set of variables. |
rxSSCP | Calculate the sum of squares / cross-product matrix for a set of variables. |
rxRoc | Receiver Operating Characteristic (ROC) computations using actual and predicted values from binary classifier system. |
* Signifies the most popular functions in this category.
7-Compute context functions
Function name | Description |
---|---|
RxComputeContext | Creates a compute context. |
rxSetComputeContext | Sets a compute context. |
rxGetComputeContext | Gets the current compute context. |
RxSpark | Creates an in-data, file-based Spark compute context. Computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark. |
RxHadoopMR | Creates an in-data, file-based Hadoop compute context. |
RxInTeradata | Creates an in-database compute context for Teradata. |
RxInSqlServer | Creates an in-database compute context for SQL Server. |
RxLocalSeq | Creates a local compute context for rxExec using sequential computations. |
RxLocalParallel | Creates a local compute context for rxExec using the **parallel* package as backend. |
RxForeachDoPar | Creates a compute context for rxExec using the current foreach parallel backend. |
8-Distributed computing
These functions and many more can be used for high performance computing and distributed computing. Learn more about the entire set of functions in Distributed Computing.
Function name | Description |
---|---|
rxExec | Run an arbitrary R function on nodes or cores of a cluster. |
rxRngNewStream | Support for Parallel Random Number Generation. |
rxRngDelStream | Support for Parallel Random Number Generation. |
rxRngGetStream | Support for Parallel Random Number Generation. |
rxRngSetStream | Support for Parallel Random Number Generation. |
rxGetAvailableNodes | Get all the available nodes on a distributed compute context. |
rxGetNodeInfo | Get information on nodes specified for a distributed compute context. |
rxPingNodes | Test round trip from user through computation node(s) in a cluster or cloud. |
rxGetJobStatus | Get the status of a non-waiting distributed computing job. |
rxGetJobResults | Get the return object(s) of a non-waiting distributed computing job. |
rxGetJobOutput | Get the console output from a non-waiting distributed computing job. |
rxGetJobs | Get the available distributed computing job information objects. |
rxLocateFile | Get the first occurrence of a specified input file in a set of specified paths. |
9-Utility functions
Some of the utility functions are operational in local compute context only. Check the documentation of individual functions to confirm.
Function name | Description |
---|---|
rxOptions | Gets or sets a specific option. |
rxGetOption | Retrieves a specific RevoScaleR option. |
rxGetEnableThreadPool | Gets the current state of the thread pool, which on Linux can be either persistent or on-demand. |
rxSetEnableThreadPool | Sets the thread pool state. |
rxStepControl | Construct a variable.selection argument for rxLinMod. |
10-Package management
Function name | Description |
---|---|
rxInstallPackages | Installs a package. |
rxInstalledPackages | Returns the list of installed packages for a compute context. |
rxFindPackage | Returns the path to one or more packages for a compute context. |
rxRemovePackages | Removes installed packages from a compute context. |
rxSqlLibPaths | Gets the search path for the library trees for packages while executing inside the SQL server. |
Next steps
Add R packages to your computer by running setup:
Next, follow these tutorials for hands-on experience: