Extend U-SQL scripts with R code in Azure Data Lake Analytics
Important
Azure Data Lake Analytics retired on 29 February 2024. Learn more with this announcement.
For data analytics, your organization can use Azure Synapse Analytics or Microsoft Fabric.
The following example illustrates the basic steps for deploying R code:
- Use the
REFERENCE ASSEMBLY
statement to enable R extensions for the U-SQL Script. - Use the
REDUCE
operation to partition the input data on a key. - The R extensions for U-SQL include a built-in reducer (
Extension.R.Reducer
) that runs R code on each vertex assigned to the reducer. - Usage of dedicated named data frames called
inputFromUSQL
andoutputToUSQL
respectively to pass data between U-SQL and R. Input and output DataFrame identifier names are fixed (that is, users can't change these predefined names of input and output DataFrame identifiers).
Embedding R code in the U-SQL script
You can inline the R code your U-SQL script by using the command parameter of the Extension.R.Reducer
. For example, you can declare the R script as a string variable and pass it as a parameter to the Reducer.
REFERENCE ASSEMBLY [ExtR];
DECLARE @myRScript = @"
inputFromUSQL$Species = as.factor(inputFromUSQL$Species)
lm.fit=lm(unclass(Species)~.-Par, data=inputFromUSQL)
#do not return readonly columns and make sure that the column names are the same in usql and r cripts,
outputToUSQL=data.frame(summary(lm.fit)$coefficients)
colnames(outputToUSQL) <- c(""Estimate"", ""StdError"", ""tValue"", ""Pr"")
outputToUSQL
";
@RScriptOutput = REDUCE … USING new Extension.R.Reducer(command:@myRScript, ReturnType:"dataframe");
Keep the R code in a separate file and reference it the U-SQL script
The following example illustrates a more complex usage. In this case, the R code is deployed as a RESOURCE that is the U-SQL script.
Save this R code as a separate file.
load("my_model_LM_Iris.rda")
outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval="confidence"))
Use a U-SQL script to deploy that R script with the DEPLOY RESOURCE statement.
REFERENCE ASSEMBLY [ExtR];
DEPLOY RESOURCE @"/usqlext/samples/R/RinUSQL_PredictUsingLinearModelasDF.R";
DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda";
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/LMPredictionsIris.txt";
DECLARE @PartitionCount int = 10;
@InputData =
EXTRACT
SepalLength double,
SepalWidth double,
PetalLength double,
PetalWidth double,
Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData =
SELECT
Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par,
SepalLength,
SepalWidth,
PetalLength,
PetalWidth
FROM @InputData;
// Predict Species
@RScriptOutput = REDUCE @ExtendedData ON Par
PRODUCE Par, fit double, lwr double, upr double
READONLY Par
USING new Extension.R.Reducer(scriptFile:"RinUSQL_PredictUsingLinearModelasDF.R", rReturnType:"dataframe", stringsAsFactors:false);
OUTPUT @RScriptOutput TO @OutputFilePredictions USING Outputters.Tsv();
How R Integrates with U-SQL
Datatypes
- String and numeric columns from U-SQL are converted as-is between R DataFrame and U-SQL [supported types:
double
,string
,bool
,integer
,byte
]. - The
Factor
datatype isn't supported in U-SQL. byte[]
must be serialized as a base64-encodedstring
.- U-SQL strings can be converted to factors in R code, once U-SQL create R input dataframe or by setting the reducer parameter
stringsAsFactors: true
.
Schemas
- U-SQL datasets can't have duplicate column names.
- U-SQL datasets column names must be strings.
- Column names must be the same in U-SQL and R scripts.
- Readonly column can't be part of the output dataframe. Because readonly columns are automatically injected back in the U-SQL table if it's a part of output schema of UDO.
Functional limitations
- The R Engine can't be instantiated twice in the same process.
- Currently, U-SQL doesn't support Combiner UDOs for prediction using partitioned models generated using Reducer UDOs. Users can declare the partitioned models as resource and use them in their R Script (see sample code
ExtR_PredictUsingLMRawStringReducer.usql
)
R Versions
Only R 3.2.2 is supported.
Standard R modules
base
boot
Class
Cluster
codetools
compiler
datasets
doParallel
doRSR
foreach
foreign
Graphics
grDevices
grid
iterators
KernSmooth
lattice
MASS
Matrix
Methods
mgcv
nlme
Nnet
Parallel
pkgXMLBuilder
RevoIOQ
revoIpe
RevoMods
RevoPemaR
RevoRpeConnector
RevoRsrConnector
RevoScaleR
RevoTreeView
RevoUtils
RevoUtilsMath
Rpart
RUnit
spatial
splines
Stats
stats4
survival
Tcltk
Tools
translations
utils
XML
Input and Output size limitations
Every vertex has a limited amount of memory assigned to it. Because the input and output DataFrames must exist in memory in the R code, the total size for the input and output can't exceed 500 MB.
Sample code
More sample code is available in your Data Lake Store account after you install the U-SQL Advanced Analytics extensions. The path for more sample code is: <your_account_address>/usqlext/samples/R
.
Deploying Custom R modules with U-SQL
First, create an R custom module and zip it and then upload the zipped R custom module file to your ADL store. In the example, we'll upload magittr_1.5.zip to the root of the default ADLS account for the ADLA account we're using. Once you upload the module to ADL store, declare it as use DEPLOY RESOURCE to make it available in your U-SQL script and call install.packages
to install it.
REFERENCE ASSEMBLY [ExtR];
DEPLOY RESOURCE @"/magrittr_1.5.zip";
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFileModelSummary string = @"/R/Output/CustomPackages.txt";
// R script to run
DECLARE @myRScript = @"
# install the magrittr package,
install.packages('magrittr_1.5.zip', repos = NULL),
# load the magrittr package,
require(magrittr),
# demonstrate use of the magrittr package,
2 %>% sqrt
";
@InputData =
EXTRACT SepalLength double,
SepalWidth double,
PetalLength double,
PetalWidth double,
Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData =
SELECT 0 AS Par,
*
FROM @InputData;
@RScriptOutput = REDUCE @ExtendedData ON Par
PRODUCE Par, RowId int, ROutput string
READONLY Par
USING new Extension.R.Reducer(command:@myRScript, rReturnType:"charactermatrix");
OUTPUT @RScriptOutput TO @OutputFileModelSummary USING Outputters.Tsv();