RevoScaleR (R package in SQL Server Machine Learning Services)
Applies to: SQL Server 2016 (13.x) and later versions
RevoScaleR is an R package from Microsoft that supports distributed computing, remote compute contexts, and high-performance data science algorithms. It also supports data import, data transformation, summarization, visualization, and analysis. The package is included in SQL Server Machine Learning Services and SQL Server 2016 R Services.
In contrast with base R functions, RevoScaleR operations can be performed against large datasets, in parallel, and on distributed file systems. Functions can operate over datasets that do not fit in memory by using chunking and by reassembling results when operations are complete.
RevoScaleR functions are denoted with a rx** or Rx prefix to make them easy to identify.
RevoScaleR serves as a platform for distributed data science. For example, you can use the RevoScaleR compute contexts and transformations with the state-of-the-art algorithms in MicrosoftML. You can also use rxExec to run base R functions in parallel.
Full reference documentation
The RevoScaleR package is distributed in multiple Microsoft products, but usage is the same whether you get the package in SQL Server or another product. Because the functions are the same, documentation for individual RevoScaleR functions is published to just one location under the R reference. Should any product-specific behaviors exist, discrepancies will be noted in the function help page.
Versions and platforms
The RevoScaleR package is based on R 3.4.3 and available only when you install one of the following Microsoft products or downloads:
Note
Full product release versions are Windows-only in SQL Server 2017. Both Windows and Linux are supported for RevoScaleR in SQL Server 2019.
Functions by category
This section lists the functions by category to give you an idea of how each one is used. You can also use the table of contents to find functions in alphabetical order.
1-Data source and compute
RevoScaleR includes functions for creating data sources and setting the location, or compute context, of where computations are performed. A data source object is a container that specifies a connection string together with the set of data that you want, defined either as a table, view, or query. Stored procedure calls are not supported. Functions relevant to SQL Server scenarios are listed in the table below.
SQL Server and R use different data types in some cases. For a list of mappings between SQL and R data types, see R-to-SQL data types.
Function | Description |
---|---|
RxInSqlServer | Create a SQL Server compute context object to push computations to a remote instance. Several RevoScaleR functions take compute context as an argument. |
rxGetComputeContext / rxSetComputeContext | Get or set the active compute context. |
RxSqlServerData | Create a data object based on a SQL Server query or table. |
RxOdbcData | Create a data source based on an ODBC connection. |
RxXdfData | Create a data source based on a local XDF file. XDF files are often used to offload in-memory data to disk. An XDF file can be useful when working with more data than can be transferred from the database in one batch, or more data than can fit in memory. For example, if you regularly move large amounts of data from a database to a local workstation, rather than query the database repeatedly for each R operation, you can use the XDF file as a kind of cache to save the data locally and then work with it in your R workspace. |
Tip
If you are new to the idea of data sources or compute contexts, we recommend that you start with the article Distributed computing.
Perform DDL statements
You can execute DDL statements from R, if you have the necessary permissions on the instance and database. The following functions use ODBC calls to execute DDL statements or retrieve the database schema.
Function | Description |
---|---|
rxSqlServerTableExists and rxSqlServerDropTable | Drop a SQL Server table, or check for the existence of a database table or object. |
rxExecuteSQLDDL | Execute a Data Definition Language (DDL) command that defines or manipulates database objects. This function cannot return data, and is used only to retrieve or modify the object schema or metadata. |
2-Data manipulation (ETL)
After you have created a data source object, you can use the object to load data into it, transform data, or write new data to the specified destination. Depending on the size of the data in the source, you can also define the batch size as part of the data source and move data in chunks.
Function | Description |
---|---|
rxOpen-methods | Check whether a data source is available, open or close a data source, read data from a source, write data to the target, and close a data source. |
rxImport | Move data from a data source into file storage or into a data frame. |
rxDataStep | Transform data while moving it between data sources. |
3-Graphing functions
Function name | Description |
---|---|
rxHistogram | Creates a histogram from data. |
rxLinePlot | Creates a line plot from data. |
rxLorenz | Computes a Lorenz curve, which can be plotted. |
rxRocCurve | Computes and plots ROC curves from actual and predicted data. |
4-Descriptive statistics
Function name | Description |
---|---|
rxQuantile * | Computes approximate quantiles for .xdf files and data frames without sorting. |
rxSummary * | Basic summary statistics of data, including computations by group. Writing by group computations to .xdf file not supported. |
rxCrossTabs * | Formula-based cross-tabulation of data. |
rxCube * | Alternative formula-based cross-tabulation designed for efficient representation returning cube results. Writing output to .xdf file not supported. |
rxMarginals | Marginal summaries of cross-tabulations. |
as.xtabs | Converts cross tabulation results to an xtabs object. |
rxChiSquaredTest | Performs Chi-squared Test on xtabs object. Used with small data sets and does not chunk data. |
rxFisherTest | Performs Fisher's Exact Test on xtabs object. Used with small data sets and does not chunk data. |
rxKendallCor | Computes Kendall's Tau Rank Correlation Coefficient using xtabs object. |
rxPairwiseCrossTab | Apply a function to pairwise combinations of rows and columns of an xtabs object. |
rxRiskRatio | Calculate the relative risk on a two-by-two xtabs object. |
rxOddsRatio | Calculate the odds ratio on a two-by-two xtabs object. |
* Signifies the most popular functions in this category.
5-Prediction functions
Function name | Description |
---|---|
rxLinMod * | Fits a linear model to data. |
rxLogit * | Fits a logistic regression model to data. |
rxGlm * | Fits a generalized linear model to data. |
rxCovCor * | Calculate the covariance, correlation, or sum of squares / cross-product matrix for a set of variables. |
rxDTree * | Fits a classification or regression tree to data. |
rxBTrees * | Fits a classification or regression decision forest to data using a stochastic gradient boosting algorithm. |
rxDForest * | Fits a classification or regression decision forest to data. |
rxPredict * | Calculates predictions for fitted models. Output must be an XDF data source. |
rxKmeans * | Performs k-means clustering. |
rxNaiveBayes * | Performs Naive Bayes classification. |
rxCov | Calculate the covariance matrix for a set of variables. |
rxCor | Calculate the correlation matrix for a set of variables. |
rxSSCP | Calculate the sum of squares / cross-product matrix for a set of variables. |
rxRoc | Receiver Operating Characteristic (ROC) computations using actual and predicted values from binary classifier system. |
* Signifies the most popular functions in this category.
How to work with RevoScaleR
Functions in RevoScaleR are callable in R code encapsulated in stored procedures. Most developers build RevoScaleR solutions locally, and then migrate finished R code to stored procedures as a deployment exercise.
When running locally, you typically run an R script from the command line, or from an R development environment, and specify a SQL Server compute context using one of the RevoScaleR functions. You can use the remote compute context for the entire code, or for individual functions. For example, you might want to offload model training to the server to use the latest data and avoid data movement.
When you are ready to encapsulate R script inside a stored procedure, sp_execute_external_script, we recommend rewriting the code as a single function that has clearly defined inputs and outputs.