# rxPredict: Predicted values and residuals for model objects built using RevoScaleR

## Description

Compute predicted values and residuals using the following objects: rxLinMod, rxGlm, rxLogit, rxDTree, rxBTrees, and rxDForest.

## Usage

```
rxPredict(modelObject, data = NULL, ...)
## S3 method for class `default':
rxPredict (modelObject, data = NULL, outData = NULL,
computeStdErrors = FALSE, interval = "none", confLevel = 0.95,
computeResiduals = FALSE, type = c("response", "link"),
writeModelVars = FALSE, extraVarsToWrite = NULL, removeMissings = FALSE,
append = c("none", "rows"), overwrite = FALSE, checkFactorLevels = TRUE,
predVarNames = NULL, residVarNames = NULL,
intervalVarNames = NULL, stdErrorsVarNames = NULL, predNames = NULL,
blocksPerRead = rxGetOption("blocksPerRead"),
reportProgress = rxGetOption("reportProgress"), verbose = 0,
xdfCompressionLevel = rxGetOption("xdfCompressionLevel"), ...)
```

## Arguments

`modelObject`

object returned from a RevoScaleR model fitting function. Valid values include `rxLinMod`

, `rxLogit`

, `rxGlm`

, `rxDTree`

, `rxBTrees`

, and `rxDForest`

. Objects with multiple dependent variables are not supported in rxPredict.

`data`

An RxXdfData data source object to be used for predictions. If not using a distributed compute context such as RxHadoopMR, a data frame, or a character string specifying the input .xdf file can also be used.

`outData`

file or existing data frame to store predictions; can be same as the input file or `NULL`

. If not `NULL`

, a character string specifying the output �.xdf� file, a RxXdfData object, a RxOdbcData data source, or a RxSqlServerData data source. `outData`

can also be a delimited RxTextData data source if using a native file system and not appending.

`computeStdErrors`

logical value. If `TRUE`

, the standard errors for each dependent variable are calculated.

`interval`

character string defining the type of interval calculation to perform. Supported values are `"none"`

, `"confidence"`

, and `"prediction"`

.

`confLevel`

numeric value representing the confidence level on the interval [0, 1].

`computeResiduals`

logical value. If `TRUE`

, residuals are computed.

`type`

Applies to rxGlm and rxLogit, used to set the type of prediction. Valid values are `"response"`

and `"link"`

. If `type = "response"`

, the predictions are on the scale of the response variable. For instance, for the binomial model, the predictions are in the range (0,1). If `type = "link"`

, the predictions are on the scale of the linear predictors. Thus for the binomial model, the predictions are of log-odds.

`writeModelVars`

logical value. If `TRUE`

, and the output data set is different from the input data set, variables in the model will be written to the output data set in addition to the predictions (and residuals, standard errors, and confidence bounds, if requested). If variables from the input data set are transformed in the model, the transformed variables will also be included.

`extraVarsToWrite`

`NULL`

or character vector of additional variables names from the input data or transforms to include in the `outData`

. If `writeModelVars`

is `TRUE`

, model variables will be included as well.

`removeMissings`

logical value. If `TRUE`

, rows with missing values are removed.

`append`

either `"none"`

to create a new files or `"rows"`

to append rows to an existing file. If `outData`

exists and `append`

is `"none"`

, the `overwrite`

argument must be set to `TRUE`

. You can append only to RxTeradata data source. Ignored for data frames.

`overwrite`

logical value. If `TRUE`

, an existing `outData`

will be overwritten. `overwrite`

is ignored if appending rows. Ignored for data frames.

`checkFactorLevels`

logical value. If `TRUE`

, up to 1000 factor levels for the data will be verified against factor levels in the model. Setting to `FALSE`

can speed up computations if using lots of factors.

`predVarNames`

character vector specifying name(s) to give to the prediction results

`residVarNames`

character vector specifying name(s) to give to the residual results.

`intervalVarNames`

`NULL`

or a character vector defining low and high confidence interval variable names, respectively. If `NULL`

, the strings `"_Lower"`

and `"_Upper"`

are appended to the dependent variable names to form the confidence interval variable names.

`stdErrorsVarNames`

`NULL`

or a character vector defining variable names corresponding to the standard errors, if calculated. If `NULL`

, the string `"_StdErr"`

is appended to the dependent variable names to form the standard errors variable names.

`predNames`

character vector specifying name(s) to give to the prediction and residual results; if length is 2, the second name is used for residuals. This argument is deprecated and `predVarNames`

and `residVarNames`

should be used instead.

`blocksPerRead`

number of blocks to read for each chunk of data read from the data source. If the `data`

and `outData`

are the same file, blocksPerRead must be 1.

`reportProgress`

integer value with options:

`0`

: no progress is reported.`1`

: the number of processed rows is printed and updated.`2`

: rows processed and timings are reported.`3`

: rows processed and all timings are reported.

`verbose`

integer value. If `0`

, no additional output is printed. If `1`

, additional summary information is printed.

`xdfCompressionLevel`

integer in the range of -1 to 9 indicating the compression level for the output data if written to an `.xdf`

file. The higher the value, the greater the amount of compression - resulting in smaller files but a longer time to create them. If `xdfCompressionLevel`

is set to 0, there will be no compression and files will be compatible with the 6.0 release of Revolution R Enterprise. If set to -1, a default level of compression will be used.

` ...`

additional arguments to be passed directly to the Revolution Compute Engine.

## Details

`rxPredict`

computes predicted values and/or residuals from an existing
model type. The most common way to call rxPredict is `rxPredict(modelObject, data, outData)`

. Typically, all the other arguments are left at their defaults.

For rxLogit, the residuals are equivalent to those
computed for `glm`

with `type`

set to `"response"`

, e.g.,
`residuals(glmObj, type="response")`

.

If the `data`

is the same data used to create the `modelObject`

, the
predicted values are the fitted values for the original model.

If the `data`

specified is an .xdf file, the `outData`

must
be `NULL`

or an .xdf file. If `outData`

is an .xdf
file, the computed data will be appended as columns. If `outData`

is
`NULL`

, the computed columns will be appended to the `data`

.xdf file.

If the `data`

specified is a data frame, the `outData`

must be
`NULL`

or a data frame. If `outData`

is a data frame, a copy of the
data frame with the new columns appended will be returned. If `outData`

is `NULL`

, a vector or list of the computed values will be returned.

If a transformation function is being used for the model estimation,
the information variable `.rxIsPrediction`

can be used to
exclude computations for the dependent variable when running
`rxPredict`

. See rxTransform for an example.

## Value

If a data frame is specified as the input `data`

, a data frame is returned.
If a data frame is specified as the `outData`

, variables containing the
results are added to the data frame and it is returned.
If `outData`

is `NULL`

, a data frame containing
the predicted values (and residuals and standard errors, if requested) is returned.

If an .xdf file is specified as the input `data`

, an RxXdfData
data source object is returned that can be used in subsequent RevoScaleR analyses.
If `outData`

is an .xdf file, the RxXdfData
data source represents the `outData`

file. If `outData`

is `NULL`

,
the predicted values (and, if requested, residuals) are appended to the original
`data`

file. The returned RxXdfData object represents this file.

## Computing Standard Errors of Predicted Values

Use `computeStdErrors`

to control whether or not prediction standard errors are computed.

Use `interval`

to control whether confidence or prediction (tolerance) intervals are computed at the specified level (`confLevel`

).
These are sometimes referred to as *narrow* and *wide* intervals, respectively.

Use `stdErrorsVarNames`

to name the standard errors output variable
and `intervalVarNames`

to specify the output variable names
of the lower and upper confidence/tolerance intervals.

In calculating the prediction standard errors, keep the following in mind:

* Prediction standard errors are available for both rxLinMod and rxLogit models.

* Standard errors are computationally intensive for large models, i.e., those involving a large number of model parameters.

*
rxLinMod and rxLogit must be called with `covCoef = TRUE`

because the variance-covariance
matrix of the coefficients must be available.

*
Cube regressions are not supported (`cube = TRUE`

).

* Multiple dependent variables are currently not supported.

*
For rxLogit, `interval = "confidence"`

is supported (unlike predict.glm,
which does not support confidence bounds), but `interval = "prediction"`

is not supported.

*
If residuals are requested, and if there are missing values in the dependent variable,
then all computed values (prediction, standard errors, confidence levels) will be
assigned the value missing, and will be removed if `removeMissings = TRUE`

.
If no residuals are requested, then missings in the dependent variable (which need not exist
in the data) have no effect.

## Author(s)

Microsoft Corporation `Microsoft Technical Support`

## See Also

rxLinMod, rxLogit, rxGlm, rxPredict.rxDTree, rxPredict.rxDForest, rxPredict.rxNaiveBayes.

## Examples

```
# Load the built-in iris data set and predict sepal length
myIris <- iris
myIris[1:5,]
form <- Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
irisLinMod <- rxLinMod(form, data = myIris)
myIrisPred <- rxPredict(modelObject = irisLinMod, data = myIris)
myIris$SepalLengthPred <- myIrisPred$Sepal.Length_Pred
myIris[1:5,]
irisResiduals<- rxPredict(modelObject = irisLinMod, data = myIris, computeResiduals = TRUE)
names(irisResiduals)
# Use sample data to compare lm and glm results with rxPredict
sampleDataDir <- rxGetOption("sampleDataDir")
mortFile <- file.path(sampleDataDir, "mortDefaultSmall.xdf")
linModPredictFile <- file.path(tempdir(), "mortPredictLinMod.xdf")
logitPredictFile <- file.path(tempdir(), "mortPredictLogit.xdf")
mortDF <- rxDataStep(inData = mortFile)
# Compare residuals from rxLinMod with lm
linMod <- rxLinMod(creditScore ~ yearsEmploy, data = mortFile)
rxPredict(modelObject = linMod, data = mortFile, outData = linModPredictFile,
computeResiduals = TRUE)
residDF <- rxDataStep(inData = linModPredictFile)
mortLM <- lm(creditScore ~ yearsEmploy, data = mortDF)
# Sum of differences should be very small
sum(mortLM$residuals - residDF$creditScore_Resid)
# Create logit model object and compute predictions and residuals
logitModObj <- rxLogit(default ~ creditScore, data = mortFile)
rxPredict(modelObject = logitModObj, data = mortFile,
outData = logitPredictFile, computeResiduals = TRUE)
residDF <- rxDataStep(inData = logitPredictFile)
mortGLM <- glm(default ~ creditScore, data = mortDF, family = binomial())
# maximum differences should be very small
max(abs(mortGLM$fitted.values - residDF$default_Pred))
max(abs(residuals(mortGLM, type = "response") - residDF$default_Resid))
```