Fitting Logistic Regression Models using Machine Learning Server
Important
This content is being retired and may not be updated in the future. The support for Machine Learning Server will end on July 1, 2022. For more information, see What's happening to Machine Learning Server?
Logistic regression is a standard tool for modeling data with a binary response variable. In R, you fit a logistic regression using the glm function, specifying a binomial family and the logit link function. In RevoScaleR, you can use rxGlm in the same way (see Fitting Generalized Linear Models) or you can fit a logistic regression using the optimized rxLogit function; because this function is specific to logistic regression, you need not specify a family or link function.
A Simple Logistic Regression Example
As an example, consider the kyphosis data set in the rpart package. This data set consists of 81 observations of four variables (Age, Number, Kyphosis, Start) in children following corrective spinal surgery; it is used as the initial example of glm in the White Book (see Additional Resources for more information. The variable Kyphosis reports the absence or presence of this deformity.
We can use rxLogit to model the probability that kyphosis is present as follows:
library(rpart)
rxLogit(Kyphosis ~ Age + Start + Number, data = kyphosis)
The following output is returned:
Logistic Regression Results for: Kyphosis ~ Age + Start + Number
Data: kyphosis
Dependent variable(s): Kyphosis
Total independent variables: 4
Number of valid observations: 81
Number of missing observations: 0
Coefficients:
Kyphosis
(Intercept) -2.03693354
Age 0.01093048
Start -0.20651005
Number 0.41060119
The same model can be fit with glm
(or rxGlm
) as follows:
glm(Kyphosis ~ Age + Start + Number, family = binomial, data = kyphosis)
Call: glm(formula = Kyphosis ~ Age + Start + Number, family = binomial, data = kyphosis)
Coefficients:
(Intercept) Age Start Number
-2.03693 0.01093 -0.20651 0.41060
Degrees of Freedom: 80 Total (i.e. Null); 77 Residual
Null Deviance: 83.23
Residual Deviance: 61.38 AIC: 69.38
Stepwise Logistic Regression
Stepwise logistic regression is an algorithm that helps you determine which variables are most important to a logistic model. You provide a minimal, or lower, model formula and a maximal, or upper, model formula, and using forward selection, backward elimination, or bidirectional search, the algorithm determines the model formula that provides the best fit based on an AIC or significance level selection criterion.
RevoScaleR provides an implementation of stepwise logistic regression that is not constrained by the use of "in-memory" algorithms. Stepwise linear regression in RevoScaleR is implemented by the functions rxLogit and rxStepControl.
Stepwise logistic regression begins with an initial model of some sort. We can look at the kyphosis data again and start with a simpler model: Kyphosis ~ Age:
initModel <- rxLogit(Kyphosis ~ Age, data=kyphosis)
initModel
Logistic Regression Results for: Kyphosis ~ Age
Data: kyphosis
Dependent variable(s): Kyphosis
Total independent variables: 2
Number of valid observations: 81
Number of missing observations: 0
Coefficients:
Kyphosis
(Intercept) -1.809351230
Age 0.005441758
We can specify a stepwise model using rxLogit and rxStepControl as follows:
KyphStepModel <- rxLogit(Kyphosis ~ Age,
data = kyphosis,
variableSelection = rxStepControl(method="stepwise",
scope = ~ Age + Start + Number ))
KyphStepModel
Logistic Regression Results for: Kyphosis ~ Age + Start + Number
Data: kyphosis
Dependent variable(s): Kyphosis
Total independent variables: 4
Number of valid observations: 81
Number of missing observations: 0
Coefficients:
Kyphosis
(Intercept) -2.03693354
Age 0.01093048
Start -0.20651005
Number 0.41060119
The methods for variable selection (forward, backward, and stepwise), the definition of model scope, and the available selection criteria are all the same as for stepwise linear regression; see "Stepwise Variable Selection" and the rxStepControl help file for more details.
Plotting Model Coefficients
The ability to save model coefficients using the argument keepStepCoefs = TRUE within the rxStepControl call and to plot them with the function rxStepPlot was described in great detail for stepwise rxLinMod in Fitting Linear Models using RevoScaleR. This functionality is also available for stepwise rxLogit objects.
Prediction
As described above for linear models, the objects returned by the RevoScaleR model-fitting functions do not include fitted values or residuals. We can obtain them, however, by calling rxPredict on our fitted model object, supplying the original data used to fit the model as the data to be used for prediction.
For example, consider the mortgage default example in Tutorial: Analyzing loan data with RevoScaleR. In that example, we used ten input data files to create the data set used to fit the model. But suppose instead we use nine input data files to create the training data set and use the remaining data set for prediction. We can do that as follows (again, remember to modify the first line for your own system):
# Logistic Regression Prediction
bigDataDir <- "C:/MRS/Data"
mortCsvDataName <- file.path(bigDataDir, "mortDefault", "mortDefault")
trainingDataFileName <- "mortDefaultTraining"
mortCsv2009 <- paste(mortCsvDataName, "2009.csv", sep = "")
targetDataFileName <- "mortDefault2009.xdf"
ageLevels <- as.character(c(0:40))
yearLevels <- as.character(c(2000:2009))
colInfo <- list(list(name = "houseAge", type = "factor",
levels = ageLevels), list(name = "year", type = "factor",
levels = yearLevels))
append= FALSE
for (i in 2000:2008)
{
importFile <- paste(mortCsvDataName, i, ".csv", sep = "")
rxImport(inData = importFile, outFile = trainingDataFileName,
colInfo = colInfo, append = append)
append = TRUE
}
rxImport(inData = mortCsv2009, outFile = targetDataFileName,
colInfo = colInfo)
We can then fit a logistic regression model to the training data and predict with the prediction data set as follows:
logitObj <- rxLogit(default ~ year + creditScore + yearsEmploy + ccDebt,
data = trainingDataFileName, blocksPerRead = 2, verbose = 1,
reportProgress=2)
rxPredict(logitObj, data = targetDataFileName,
outData = targetDataFileName, computeResiduals = TRUE)
The
blocksPerRead
argument is ignored if run locally using R Client. Learn more...
To view the first 30 rows of the output data file, use rxGetInfo as follows:
rxGetInfo(targetDataFileName, numRows = 30)
Prediction Standard Errors and Confidence Intervals
You can use rxPredict to obtain prediction standard errors and confidence intervals for models fit with rxLogit in the same way as for those fit with rxLinMod. The original model must be fit with covCoef=TRUE:
# Prediction Standard Errors and Confidence Intervals
logitObj2 <- rxLogit(default ~ year + creditScore + yearsEmploy + ccDebt,
data = trainingDataFileName, blocksPerRead = 2, verbose = 1,
reportProgress=2, covCoef=TRUE)
The
blocksPerRead
argument is ignored if run locally using R Client. Learn more...
You then specify computeStdErr=TRUE
to obtain prediction standard errors; if this is TRUE, you can also specify interval="confidence"
to obtain a confidence interval:
rxPredict(logitObj2, data = targetDataFileName,
outData = targetDataFileName, computeStdErr = TRUE,
interval = "confidence", overwrite=TRUE)
The first ten lines of the file with predictions can be viewed as follows:
rxGetInfo(targetDataFileName, numRows=10)
File name: C:\Users\yourname\Documents\MRS\mortDefault2009.xdf
Number of observations: 1e+06
Number of variables: 10
Number of blocks: 2
Compression type: zlib
Data (10 rows starting with row 1):
creditScore houseAge yearsEmploy ccDebt year default default_Pred
1 617 20 8 4410 2009 0 6.620773e-06
2 623 11 7 5609 2009 0 4.610861e-05
3 758 17 4 7250 2009 0 4.259884e-04
4 687 22 5 3761 2009 0 3.770789e-06
5 663 15 6 6746 2009 0 2.312827e-04
6 676 10 2 7106 2009 0 1.092593e-03
7 721 23 2 2280 2009 0 8.515912e-07
8 680 18 7 2831 2009 0 6.011109e-07
9 734 9 5 3867 2009 0 3.144299e-06
10 688 16 8 6238 2009 0 5.350031e-05
default_StdErr default_Lower default_Upper
1 3.143695e-07 6.032422e-06 7.266507e-06
2 1.953612e-06 4.243427e-05 5.010109e-05
3 1.594783e-05 3.958500e-04 4.584203e-04
4 1.739047e-07 3.444893e-06 4.127516e-06
5 8.733193e-06 2.147838e-04 2.490486e-04
6 3.952975e-05 1.017797e-03 1.172880e-03
7 4.396314e-08 7.696409e-07 9.422675e-07
8 3.091885e-08 5.434655e-07 6.648706e-07
9 1.469334e-07 2.869109e-06 3.445883e-06
10 2.224102e-06 4.931401e-05 5.804197e-05
Using ROC Curves to Evaluate Estimated Binary Response Models
A receiver operating characteristic (ROC) curve can be used to visually assess binary response models. It plots the True Positive Rate (the number of correctly predicted TRUE responses divided by the actual number of TRUE responses) against the False Positive Rate (the number of incorrectly predicted TRUE responses divided by the actual number of FALSE responses), calculated at various thresholds. The True Positive Rate is the same as the sensitivity, and the False Positive Rate is equal to one minus the specificity.
Let’s start with a simple example. Suppose we have a data set with 10 observations. The variable actual contains the actual responses, or the ‘truth’. The variable badPred are the predicted responses from a very poor model. The variable goodPred contains the predicted responses from a great model.
# Using ROC Curves for Binary Response Models
sampleDF <- data.frame(
actual = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
badPred = c(.99, .99, .99, .99, .99, .01, .01, .01, .01, .01),
goodPred = c( .01, .01, .01, .01, .01,.99, .99, .99, .99, .99))
We can now call the rxRocCurve function to compute the sensitivity and specificity for the ‘bad’ predictions, and draw the ROC curve. The numBreaks argument indicates the number of breaks to use in determining the thresholds for computing the true and false positive rates.
rxRocCurve(actualVarName = "actual", predVarNames = "badPred",
data = sampleDF, numBreaks = 10, title = "ROC for Bad Predictions")
Since all of our predictions are wrong at every threshold, the ROC curve is a flat line at 0. The Area Under the Curve (AUC) summary statistic is 0.
At the other extreme, let’s draw an ROC curve for our great model:
rxRocCurve(actualVarName = "actual", predVarNames = "goodPred",
data = sampleDF, numBreaks = 10, title = "ROC for Great Predictions")
With perfect predictions, we see the True Positive Rate is 1 for all thresholds, and the AUC is 1. We’d expect a random guess ROC curve to lie along with white diagonal line.
Now let’s use actual model predictions in an ROC curve. We’ll use the small mortgage default sample data to estimate a logistic model and them compute predicted values:
The
blocksPerRead
argument is ignored if run locally using R Client. Learn more...
# Using mortDefaultSmall for predictions and an ROC curve
mortXdf <- file.path(rxGetOption("sampleDataDir"), "mortDefaultSmall")
logitOut1 <- rxLogit(default ~ creditScore + yearsEmploy + ccDebt,
data = mortXdf, blocksPerRead = 5)
predFile <- "mortPred.xdf"
predOutXdf <- rxPredict(modelObject = logitOut1, data = mortXdf,
writeModelVars = TRUE, predVarNames = "Model1", outData = predFile)
Now, let’s estimate a different model (with 1 less independent variable), and add the predictions from that model to our output data set:
# Estimate a second model without ccDebt
logitOut2 <- rxLogit(default ~ creditScore + yearsEmploy,
data = predOutXdf, blocksPerRead = 5)
# Add preditions to prediction data file
predOutXdf <- rxPredict(modelObject = logitOut2, data = predOutXdf,
predVarNames = "Model2")
Now we can compute the sensitivity and specificity for both models, using rxRoc:
rocOut <- rxRoc(actualVarName = "default",
predVarNames = c("Model1", "Model2"),
data = predOutXdf)
rocOut
threshold predVarName sensitivity specificity
1 0.00 Model1 1.000000000 0.0000000
2 0.01 Model1 0.825902335 0.9197118
3 0.02 Model1 0.647558386 0.9567965
4 0.03 Model1 0.569002123 0.9721488
5 0.04 Model1 0.481953291 0.9797647
6 0.05 Model1 0.437367304 0.9845472
7 0.06 Model1 0.386411890 0.9877825
8 0.07 Model1 0.335456476 0.9900130
9 0.08 Model1 0.305732484 0.9916406
10 0.09 Model1 0.288747346 0.9930272
11 0.10 Model1 0.261146497 0.9940520
12 0.11 Model1 0.237791932 0.9947252
13 0.12 Model1 0.225053079 0.9953682
14 0.13 Model1 0.208067941 0.9959107
15 0.14 Model1 0.197452229 0.9963528
16 0.15 Model1 0.182590234 0.9967648
17 0.16 Model1 0.171974522 0.9971064
18 0.17 Model1 0.161358811 0.9973877
19 0.18 Model1 0.152866242 0.9975886
20 0.19 Model1 0.150743100 0.9978298
21 0.20 Model1 0.144373673 0.9980307
22 0.21 Model1 0.138004246 0.9982518
23 0.22 Model1 0.131634820 0.9984527
24 0.23 Model1 0.131634820 0.9986034
25 0.24 Model1 0.129511677 0.9987340
26 0.25 Model1 0.123142251 0.9987843
27 0.26 Model1 0.116772824 0.9988546
28 0.27 Model1 0.116772824 0.9989149
29 0.28 Model1 0.114649682 0.9989752
30 0.29 Model1 0.108280255 0.9990355
31 0.30 Model1 0.101910828 0.9991158
32 0.31 Model1 0.099787686 0.9991661
33 0.32 Model1 0.091295117 0.9992264
34 0.33 Model1 0.087048832 0.9992866
35 0.34 Model1 0.082802548 0.9993469
36 0.35 Model1 0.080679406 0.9994072
37 0.36 Model1 0.074309979 0.9994474
38 0.37 Model1 0.072186837 0.9994675
39 0.38 Model1 0.070063694 0.9995077
40 0.39 Model1 0.067940552 0.9995780
41 0.40 Model1 0.063694268 0.9995881
42 0.41 Model1 0.063694268 0.9996182
43 0.42 Model1 0.063694268 0.9996684
44 0.43 Model1 0.055201699 0.9996986
45 0.44 Model1 0.050955414 0.9997287
46 0.45 Model1 0.048832272 0.9997790
47 0.46 Model1 0.046709130 0.9997790
48 0.47 Model1 0.042462845 0.9997991
49 0.48 Model1 0.040339703 0.9998091
50 0.49 Model1 0.040339703 0.9998292
51 0.50 Model1 0.040339703 0.9998493
52 0.51 Model1 0.040339703 0.9998794
53 0.52 Model1 0.033970276 0.9998895
54 0.53 Model1 0.031847134 0.9999096
55 0.54 Model1 0.031847134 0.9999196
56 0.55 Model1 0.031847134 0.9999297
57 0.58 Model1 0.029723992 0.9999397
58 0.59 Model1 0.027600849 0.9999498
59 0.60 Model1 0.023354565 0.9999598
60 0.61 Model1 0.016985138 0.9999598
61 0.63 Model1 0.014861996 0.9999598
62 0.65 Model1 0.014861996 0.9999799
63 0.70 Model1 0.014861996 0.9999900
64 0.72 Model1 0.012738854 0.9999900
65 0.74 Model1 0.010615711 0.9999900
66 0.78 Model1 0.010615711 1.0000000
67 0.80 Model1 0.008492569 1.0000000
68 0.83 Model1 0.006369427 1.0000000
69 0.89 Model1 0.004246285 1.0000000
70 0.91 Model1 0.000000000 1.0000000
71 0.00 Model2 1.000000000 0.0000000
72 0.01 Model2 0.108280255 0.9612776
73 0.02 Model2 0.000000000 0.9994474
74 0.03 Model2 0.000000000 1.0000000
With the removeDups argument set to its default of TRUE, rows containing duplicate entries for sensitivity and specificity were removed from the returned data frame. In this case, it results in many fewer rows for Model2 than Model1. We can use the rxRoc plot method to render our ROC curve using the computed results.
plot(rocOut)
The resulting plot shows that the second model is much closer to the “random” diagonal line than the first model.