rxPredict: Predicted values and residuals for model objects built using RevoScaleR
rxPredict(modelObject, data = NULL, ...) ## S3 method for class `default': rxPredict (modelObject, data = NULL, outData = NULL, computeStdErrors = FALSE, interval = "none", confLevel = 0.95, computeResiduals = FALSE, type = c("response", "link"), writeModelVars = FALSE, extraVarsToWrite = NULL, removeMissings = FALSE, append = c("none", "rows"), overwrite = FALSE, checkFactorLevels = TRUE, predVarNames = NULL, residVarNames = NULL, intervalVarNames = NULL, stdErrorsVarNames = NULL, predNames = NULL, blocksPerRead = rxGetOption("blocksPerRead"), reportProgress = rxGetOption("reportProgress"), verbose = 0, xdfCompressionLevel = rxGetOption("xdfCompressionLevel"), ...)
object returned from a RevoScaleR model fitting function. Valid values include
rxDForest. Objects with multiple dependent variables are not supported in rxPredict.
An RxXdfData data source object to be used for predictions. If not using a distributed compute context such as RxHadoopMR, a data frame, or a character string specifying the input .xdf file can also be used.
file or existing data frame to store predictions; can be same as the input file or
NULL. If not
NULL, a character string specifying the output �.xdf� file, a RxXdfData object, a RxOdbcData data source, or a RxSqlServerData data source.
outData can also be a delimited RxTextData data source if using a native file system and not appending.
logical value. If
TRUE, the standard errors for each dependent variable are calculated.
character string defining the type of interval calculation to perform. Supported values are
numeric value representing the confidence level on the interval [0, 1].
logical value. If
TRUE, residuals are computed.
Applies to rxGlm and rxLogit, used to set the type of prediction. Valid values are
type = "response", the predictions are on the scale of the response variable. For instance, for the binomial model, the predictions are in the range (0,1). If
type = "link", the predictions are on the scale of the linear predictors. Thus for the binomial model, the predictions are of log-odds.
logical value. If
TRUE, and the output data set is different from the input data set, variables in the model will be written to the output data set in addition to the predictions (and residuals, standard errors, and confidence bounds, if requested). If variables from the input data set are transformed in the model, the transformed variables will also be included.
NULL or character vector of additional variables names from the input data or transforms to include in the
TRUE, model variables will be included as well.
logical value. If
TRUE, rows with missing values are removed.
"none" to create a new files or
"rows" to append rows to an existing file. If
outData exists and
overwrite argument must be set to
TRUE. You can append only to RxTeradata data source. Ignored for data frames.
logical value. If
TRUE, an existing
outData will be overwritten.
overwrite is ignored if appending rows. Ignored for data frames.
logical value. If
TRUE, up to 1000 factor levels for the data will be verified against factor levels in the model. Setting to
FALSE can speed up computations if using lots of factors.
character vector specifying name(s) to give to the prediction results
character vector specifying name(s) to give to the residual results.
NULL or a character vector defining low and high confidence interval variable names, respectively. If
NULL, the strings
"_Upper" are appended to the dependent variable names to form the confidence interval variable names.
NULL or a character vector defining variable names corresponding to the standard errors, if calculated. If
NULL, the string
"_StdErr" is appended to the dependent variable names to form the standard errors variable names.
character vector specifying name(s) to give to the prediction and residual results; if length is 2, the second name is used for residuals. This argument is deprecated and
residVarNames should be used instead.
number of blocks to read for each chunk of data read from the data source. If the
outData are the same file, blocksPerRead must be 1.
integer value with options:
0: no progress is reported.
1: the number of processed rows is printed and updated.
2: rows processed and timings are reported.
3: rows processed and all timings are reported.
integer value. If
0, no additional output is printed. If
1, additional summary information is printed.
integer in the range of -1 to 9 indicating the compression level for the output data if written to an
.xdf file. The higher the value, the greater the amount of compression - resulting in smaller files but a longer time to create them. If
xdfCompressionLevel is set to 0, there will be no compression and files will be compatible with the 6.0 release of Revolution R Enterprise. If set to -1, a default level of compression will be used.
additional arguments to be passed directly to the Revolution Compute Engine.
rxPredict computes predicted values and/or residuals from an existing
model type. The most common way to call rxPredict is
rxPredict(modelObject, data, outData). Typically, all the other arguments are left at their defaults.
For rxLogit, the residuals are equivalent to those
type set to
data is the same data used to create the
predicted values are the fitted values for the original model.
data specified is an .xdf file, the
NULL or an .xdf file. If
outData is an .xdf
file, the computed data will be appended as columns. If
NULL, the computed columns will be appended to the
data specified is a data frame, the
outData must be
NULL or a data frame. If
outData is a data frame, a copy of the
data frame with the new columns appended will be returned. If
NULL, a vector or list of the computed values will be returned.
If a transformation function is being used for the model estimation,
the information variable
.rxIsPrediction can be used to
exclude computations for the dependent variable when running
rxPredict. See rxTransform for an example.
If a data frame is specified as the input
data, a data frame is returned.
If a data frame is specified as the
outData, variables containing the
results are added to the data frame and it is returned.
NULL, a data frame containing
the predicted values (and residuals and standard errors, if requested) is returned.
If an .xdf file is specified as the input
data, an RxXdfData
data source object is returned that can be used in subsequent RevoScaleR analyses.
outData is an .xdf file, the RxXdfData
data source represents the
outData file. If
the predicted values (and, if requested, residuals) are appended to the original
data file. The returned RxXdfData object represents this file.
Computing Standard Errors of Predicted Values
computeStdErrors to control whether or not prediction standard errors are computed.
interval to control whether confidence or prediction (tolerance) intervals are computed at the specified level (
These are sometimes referred to as narrow and wide intervals, respectively.
stdErrorsVarNames to name the standard errors output variable
intervalVarNames to specify the output variable names
of the lower and upper confidence/tolerance intervals.
In calculating the prediction standard errors, keep the following in mind:
* Standard errors are computationally intensive for large models, i.e., those involving a large number of model parameters.
Cube regressions are not supported (
cube = TRUE).
* Multiple dependent variables are currently not supported.
interval = "confidence" is supported (unlike predict.glm,
which does not support confidence bounds), but
interval = "prediction" is not supported.
If residuals are requested, and if there are missing values in the dependent variable,
then all computed values (prediction, standard errors, confidence levels) will be
assigned the value missing, and will be removed if
removeMissings = TRUE.
If no residuals are requested, then missings in the dependent variable (which need not exist
in the data) have no effect.
Microsoft Technical Support
# Load the built-in iris data set and predict sepal length myIris <- iris myIris[1:5,] form <- Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species irisLinMod <- rxLinMod(form, data = myIris) myIrisPred <- rxPredict(modelObject = irisLinMod, data = myIris) myIris$SepalLengthPred <- myIrisPred$Sepal.Length_Pred myIris[1:5,] irisResiduals<- rxPredict(modelObject = irisLinMod, data = myIris, computeResiduals = TRUE) names(irisResiduals) # Use sample data to compare lm and glm results with rxPredict sampleDataDir <- rxGetOption("sampleDataDir") mortFile <- file.path(sampleDataDir, "mortDefaultSmall.xdf") linModPredictFile <- file.path(tempdir(), "mortPredictLinMod.xdf") logitPredictFile <- file.path(tempdir(), "mortPredictLogit.xdf") mortDF <- rxDataStep(inData = mortFile) # Compare residuals from rxLinMod with lm linMod <- rxLinMod(creditScore ~ yearsEmploy, data = mortFile) rxPredict(modelObject = linMod, data = mortFile, outData = linModPredictFile, computeResiduals = TRUE) residDF <- rxDataStep(inData = linModPredictFile) mortLM <- lm(creditScore ~ yearsEmploy, data = mortDF) # Sum of differences should be very small sum(mortLM$residuals - residDF$creditScore_Resid) # Create logit model object and compute predictions and residuals logitModObj <- rxLogit(default ~ creditScore, data = mortFile) rxPredict(modelObject = logitModObj, data = mortFile, outData = logitPredictFile, computeResiduals = TRUE) residDF <- rxDataStep(inData = logitPredictFile) mortGLM <- glm(default ~ creditScore, data = mortDF, family = binomial()) # maximum differences should be very small max(abs(mortGLM$fitted.values - residDF$default_Pred)) max(abs(residuals(mortGLM, type = "response") - residDF$default_Resid))