Generalized Linear Models using RevoScaleR
Important
This content is being retired and may not be updated in the future. The support for Machine Learning Server will end on July 1, 2022. For more information, see What's happening to Machine Learning Server?
Generalized linear models (GLM) are a framework for a wide range of analyses. They relax the assumptions for a standard linear model in two ways. First, a functional form can be specified for the conditional mean of the predictor, referred to as the “link” function. Second, you can specify a distribution for the response variable. The rxGlm function in RevoScaleR provides the ability to estimate generalized linear models on large data sets.
The following family/link combinations are implemented in C++ for performance enhancements: binomial/logit, gamma/log, poisson/log, and Tweedie. Other family/link combinations use a combination of C++ and R code. Any valid R family object that can be used with glm() can be used with rxGlm(), including user-defined. The following table shows all of the supported family/link combinations (in addition to user-defined):
Family | Default Link Function | Other Available Link Functions |
---|---|---|
binomial | "logit" | "probit", "cauchit", "log", "cloglog" |
gaussian | "identity" | "log", "inverse" |
Gamma | "inverse" | "identity", "log" |
inverse.gaussian | "1/mu^2" | "inverse", "identity", "log" |
poisson | "log" | "identity", "sqrt" |
quasi | "identity" with variance = "constant" | "logit", "probit", "cloglog", "inverse", "log", "1/mu^2", "sqrt" |
quasibinomial | "logit" | Same as binomial, but dispersion parameter not fixed at one |
quasipoisson | "log" | Same as poisson, but dispersion parameter not fixed at one |
rxTweedie | requires arguments instead of link function |
A Simple Example Using the Poisson Family
The Poisson family is used to estimate models of count data. Examples from the literature include the following types of response variables:
- Number of drinks on a Saturday night
- Number of bacterial colonies in a Petri dish
- Number of children born to married women
- Number of credit cards a person has
We’ll start with a simple example from Kabacoff’s R in Action book, using data provided with the robust R package. The data are from a placebo-controlled clinical trial of 59 epileptics. Patients with partial seizures were enrolled in a randomized clinical trial of the anti-epileptic drug, progabide. Counts of epileptic seizures were recorded during the trial. The data set also includes a baseline 8-week seizure count and the age of the patient.
To access this data, first make sure the robust package is installed, then use the data command to load the data frame:
# A Simple Example Using the Poisson Family
if ("robust" %in% .packages()){
data(breslow.dat, package = "robust")
First, let’s get some basic information on the data set, then draw a histogram of the sumY variable, containing the total count of seizures during the trial.
rxGetInfo(breslow.dat, getVarInfo = TRUE)
rxHistogram( ~sumY, numBreaks = 25, data = breslow.dat)
The data set has 59 observations, and 12 variables. The variables of interest are Base, Age, Trt, and sumY.
Data frame: breslow.dat
Number of observations: 59
Number of variables: 12
Variable information:
Var 1: ID, Type: integer, Low/High: (101, 238)
Var 2: Y1, Type: integer, Low/High: (0, 102)
Var 3: Y2, Type: integer, Low/High: (0, 65)
Var 4: Y3, Type: integer, Low/High: (0, 76)
Var 5: Y4, Type: integer, Low/High: (0, 63)
Var 6: Base, Type: integer, Low/High: (6, 151)
Var 7: Age, Type: integer, Low/High: (18, 42)
Var 8: Trt
2 factor levels: placebo progabide
Var 9: Ysum, Type: integer, Low/High: (0, 302)
Var 10: sumY, Type: integer, Low/High: (0, 302)
Var 11: Age10, Type: numeric, Low/High: (1.8000, 4.2000)
Var 12: Base4, Type: numeric, Low/High: (1.5000, 37.7500)
To estimate a model with sumY as the response variable and the Base number of seizures, Age, and the treatment as explanatory variables, we can use rxGlm. A benefit to using rxGlm is that the code scales for use with a much bigger data set.
myGlm <- rxGlm(sumY~ Base + Age + Trt, dropFirst = TRUE,
data = breslow.dat, family = poisson())
summary(myGlm)
Results in:
Call:
rxGlm(formula = sumY ~ Base + Age + Trt, data = breslow.dat,
family = poisson(), dropFirst = TRUE)
Generalized Linear Model Results for: sumY ~ Base + Age + Trt
Data: breslow.dat
Dependent variable(s): sumY
Total independent variables: 5 (Including number dropped: 1)
Number of valid observations: 59
Number of missing observations: 0
Family-link: poisson-log
Residual deviance: 559.4437 (on 55 degrees of freedom)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9488259 0.1356191 14.370 2.22e-16 ***
Base 0.0226517 0.0005093 44.476 2.22e-16 ***
Age 0.0227401 0.0040240 5.651 5.85e-07 ***
Trt=placebo Dropped Dropped Dropped Dropped
Trt=progabide -0.1527009 0.0478051 -3.194 0.00232 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Condition number of final variance-covariance matrix: 3.3382
Number of iterations: 4
To interpret the coefficients, it is sometimes useful to transform them back to the original scale of the dependent variable. In this case:
exp(coef(myGlm))
(Intercept) Base Age Trt=placebo Trt=progabide
7.0204403 1.0229102 1.0230007 NA 0.8583864
This suggests that, controlling for the base number of seizures and age, people taking progabide during the trial had 85% of the expected number seizures compared with people who didn’t.
A common method of checking for overdispersion is to calculate the ratio of the residual deviance with the degrees of freedom. This should be about 1 to fit the assumptions of the model.
myGlm$deviance/myGlm$df[2]
[1] 10.1717
We can see that the ratio is well above one.
The quasi-poisson family can be used to handle over-dispersion. In this case, instead of assuming that the variance and mean are one, a relationship is estimated from the data:
myGlm1 <- rxGlm(sumY ~ Base + Age + Trt, dropFirst = TRUE,
data = breslow.dat, family = quasipoisson())
summary(myGlm1)
} # End of if for robust package
Call:
rxGlm(formula = sumY ~ Base + Age + Trt, data = breslow.dat,
family = quasipoisson(), dropFirst = TRUE)
Generalized Linear Model Results for: sumY ~ Base + Age + Trt
Data: breslow.dat
Dependent variable(s): sumY
Total independent variables: 5 (Including number dropped: 1)
Number of valid observations: 59
Number of missing observations: 0
Family-link: quasipoisson-log
Residual deviance: 559.4437 (on 55 degrees of freedom)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.948826 0.465091 4.190 0.000102 ***
Base 0.022652 0.001747 12.969 2.22e-16 ***
Age 0.022740 0.013800 1.648 0.105085
Trt=placebo Dropped Dropped Dropped Dropped
Trt=progabide -0.152701 0.163943 -0.931 0.355702
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 11.76075)
Condition number of final variance-covariance matrix: 3.3382
Number of iterations: 4
Notice that the coefficients are the same as when using the poisson family, but that the standard errors are larger. The effect of the treatment is no longer significant.
An Example Using the Gamma Family
The Gamma family is used with data containing positive values with a positive skew. A classic example is estimating the value of auto insurance claims. Using the sample claims.xdf data set:
# An Example Using the Gamma Family
claimsXdf <- file.path(rxGetOption("sampleDataDir"),"claims.xdf")
claimsGlm <- rxGlm(cost ~ age + car.age + type, family = Gamma,
dropFirst = TRUE, data = claimsXdf)
summary(claimsGlm)
Call:
rxGlm(formula = cost ~ age + car.age + type, data = claimsXdf,
family = Gamma, dropFirst = TRUE)
Generalized Linear Model Results for: cost ~ age + car.age + type
File name:
C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\claims.xdf
Dependent variable(s): cost
Total independent variables: 17 (Including number dropped: 3)
Number of valid observations: 123
Number of missing observations: 5
Family-link: Gamma-inverse
Residual deviance: 15.6397 (on 109 degrees of freedom)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0032807 0.0005126 6.400 4.02e-09 ***
age=17-20 Dropped Dropped Dropped Dropped
age=21-24 0.0006593 0.0005471 1.205 0.230843
age=25-29 0.0003911 0.0005183 0.755 0.452114
age=30-34 0.0012388 0.0005982 2.071 0.040720 *
age=35-39 0.0017152 0.0006514 2.633 0.009685 **
age=40-49 0.0012649 0.0006007 2.106 0.037516 *
age=50-59 0.0002863 0.0005087 0.563 0.574771
age=60+ 0.0013006 0.0006041 2.153 0.033519 *
car.age=0-3 Dropped Dropped Dropped Dropped
car.age=4-7 0.0003444 0.0003535 0.974 0.332120
car.age=8-9 0.0011005 0.0004161 2.645 0.009375 **
car.age=10+ 0.0034437 0.0006107 5.639 1.36e-07 ***
type=A Dropped Dropped Dropped Dropped
type=B -0.0004443 0.0004880 -0.911 0.364508
type=C -0.0004189 0.0004912 -0.853 0.395668
type=D -0.0016209 0.0004344 -3.732 0.000304 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.1785316)
Condition number of final variance-covariance matrix: 12.4648
Number of iterations: 4
But, these estimates are conditional on the fact that a claim was made.
An Example Using the Tweedie Family
The Tweedie family of distributions provides flexible models for estimation. The power parameter var.power determines the shape of the distribution, with familiar models as special cases: if var.power is set to 0, Tweedie is a normal distribution; when set to 1, it is Poisson; when 2, it is Gamma; when 3, it is inverse Gaussian. If var.power is between 1 and 2, it is a compound Poisson distribution and is appropriate for positive data that also contains exact zeros, for example, insurance claims data, rainfall data, or fish-catch data. If var.power is greater than 2, it is appropriate for positive data.
In this example, we use a subsample from the 5% sample of the U.S. 2000 census. We consider the annual cost of property insurance for heads of household ages 21 through 89, and its relationship to age, sex, and region. A variable “perwt” in the data set represents the probability weight for that observation. First, to create the subsample (specify the correct data path for your downloaded data):
bigDataDir = "C:/MRS/Data"
bigCensusData <- file.path(bigDataDir, "Census5PCT2000.xdf")
propinFile <- "CensusPropertyIns.xdf"
propinDS <- rxDataStep(inData = bigCensusData, outFile = propinFile,
rowSelection = (related == 'Head/Householder') & (age > 20) & (age < 90),
varsToKeep = c("propinsr", "age", "sex", "region", "perwt"),
blocksPerRead = 10, overwrite = TRUE)
rxGetInfo(propinDS)
File name: C:\YourWorkingDir\CensusPropertyIns.xdf
Number of observations: 5175270
Number of variables: 5
Number of blocks: 10
Compression type: zlib
The
blocksPerRead
argument is ignored when run locally using R Client. Learn more...
An Xdf data source representing the new data file is returned. The new data file has over 5 million observations.
Let’s do one more step in data cleaning. The variable region has some long factor level character strings, and it also has a number of levels for which there are no observations. We can see this using rxSummary:
rxSummary(~region, data = propinDS)
Call:
rxSummary(formula = ~region, data = propinDS)
Summary Statistics Results for: ~region
File name: C:\YourWorkingDir\CensusPropertyIns.xdf
Number of valid observations: 5175270
Category Counts for region
Number of categories: 17
Number of valid observations: 5175270
Number of missing observations: 0
region Counts
New England Division 265372
Middle Atlantic Division 734585
Mixed Northeast Divisions (1970 Metro) 0
East North Central Div. 847367
West North Central Div. 366417
Mixed Midwest Divisions (1970 Metro) 0
South Atlantic Division 981614
East South Central Div. 324003
West South Central Div. 553425
Mixed Southern Divisions (1970 Metro) 0
Mountain Division 328940
Pacific Division 773547
Mixed Western Divisions (1970 Metro) 0
Military/Military reservations 0
PUMA boundaries cross state lines-1% sample 0
State not identified 0
Inter-regional county group (1970 Metro samples) 0
We can use the rxFactors function rename and reduce the number of levels:
regionLevels <- list( "New England" = "New England Division",
"Middle Atlantic" = "Middle Atlantic Division",
"East North Central" = "East North Central Div.",
"West North Central" = "West North Central Div.",
"South Atlantic" = "South Atlantic Division",
"East South Central" = "East South Central Div.",
"West South Central" = "West South Central Div.",
"Mountain" ="Mountain Division",
"Pacific" ="Pacific Division")
rxFactors(inData = propinDS, outFile = propinDS,
factorInfo = list(region = list(newLevels = regionLevels,
otherLevel = "Other")),
overwrite = TRUE)
As a first step to analysis, let’s look at a histogram of the property insurance cost:
rxHistogram(~propinsr, data = propinDS, pweights = "perwt")
This data appears to be a good match for the Tweedie family with a variance power parameter between 1 and 2, since it has a “clump” of exact zeros in addition to a distribution of positive values.
We can estimate the parameters using rxGlm, setting the var.power argument to 1.5. As explanatory variables we’ll use sex, an “on-the-fly” factor variable with a level for each age, and region:
propinGlm <- rxGlm(propinsr~sex + F(age) + region,
pweights = "perwt", data = propinDS,
family = rxTweedie(var.power = 1.5), dropFirst = TRUE)
summary(propinGlm)
Call:
rxGlm(formula = propinsr ~ sex + F(age) + region, data = propinDS,
family = rxTweedie(var.power = 1.5), pweights = "perwt",
dropFirst = TRUE)
Generalized Linear Model Results for: propinsr ~ sex + F(age) + region
File name: C:\YourWorkingDir\CensusPropertyIns.xdf
Probability weights: perwt
Dependent variable(s): propinsr
Total independent variables: 82 (Including number dropped: 4)
Number of valid observations: 5175270
Number of missing observations: 0
Family-link: Tweedie-mu^-0.5
Residual deviance: 3292809839.3236 (on 5175192 degrees of freedom)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.231e-01 5.893e-04 208.961 2.22e-16 ***
sex=Male Dropped Dropped Dropped Dropped
sex=Female 9.026e-03 3.164e-05 285.305 2.22e-16 ***
F_age=21 Dropped Dropped Dropped Dropped
F_age=22 -9.208e-03 7.523e-04 -12.240 2.22e-16 ***
F_age=23 -1.980e-02 6.966e-04 -28.430 2.22e-16 ***
F_age=24 -2.856e-02 6.648e-04 -42.955 2.22e-16 ***
F_age=25 -3.652e-02 6.432e-04 -56.776 2.22e-16 ***
F_age=26 -4.371e-02 6.289e-04 -69.500 2.22e-16 ***
F_age=27 -4.894e-02 6.182e-04 -79.162 2.22e-16 ***
F_age=28 -5.398e-02 6.099e-04 -88.506 2.22e-16 ***
F_age=29 -5.787e-02 6.043e-04 -95.749 2.22e-16 ***
F_age=30 -6.064e-02 6.020e-04 -100.716 2.22e-16 ***
F_age=31 -6.336e-02 6.004e-04 -105.522 2.22e-16 ***
F_age=32 -6.526e-02 5.991e-04 -108.933 2.22e-16 ***
F_age=33 -6.721e-02 5.975e-04 -112.489 2.22e-16 ***
F_age=34 -6.854e-02 5.962e-04 -114.948 2.22e-16 ***
F_age=35 -6.942e-02 5.949e-04 -116.688 2.22e-16 ***
F_age=36 -7.090e-02 5.941e-04 -119.342 2.22e-16 ***
F_age=37 -7.184e-02 5.936e-04 -121.023 2.22e-16 ***
F_age=38 -7.265e-02 5.931e-04 -122.498 2.22e-16 ***
F_age=39 -7.354e-02 5.926e-04 -124.090 2.22e-16 ***
F_age=40 -7.401e-02 5.923e-04 -124.954 2.22e-16 ***
F_age=41 -7.462e-02 5.923e-04 -125.994 2.22e-16 ***
F_age=42 -7.508e-02 5.920e-04 -126.819 2.22e-16 ***
F_age=43 -7.568e-02 5.920e-04 -127.846 2.22e-16 ***
F_age=44 -7.597e-02 5.919e-04 -128.344 2.22e-16 ***
F_age=45 -7.642e-02 5.918e-04 -129.139 2.22e-16 ***
F_age=46 -7.693e-02 5.919e-04 -129.973 2.22e-16 ***
F_age=47 -7.727e-02 5.918e-04 -130.564 2.22e-16 ***
F_age=48 -7.749e-02 5.919e-04 -130.927 2.22e-16 ***
F_age=49 -7.783e-02 5.919e-04 -131.488 2.22e-16 ***
F_age=50 -7.809e-02 5.919e-04 -131.941 2.22e-16 ***
F_age=51 -7.853e-02 5.919e-04 -132.678 2.22e-16 ***
F_age=52 -7.888e-02 5.916e-04 -133.326 2.22e-16 ***
F_age=53 -7.919e-02 5.916e-04 -133.859 2.22e-16 ***
F_age=54 -7.909e-02 5.931e-04 -133.348 2.22e-16 ***
F_age=55 -7.938e-02 5.929e-04 -133.873 2.22e-16 ***
F_age=56 -7.930e-02 5.929e-04 -133.751 2.22e-16 ***
F_age=57 -7.959e-02 5.928e-04 -134.276 2.22e-16 ***
F_age=58 -7.935e-02 5.937e-04 -133.644 2.22e-16 ***
F_age=59 -7.923e-02 5.942e-04 -133.336 2.22e-16 ***
F_age=60 -7.894e-02 5.946e-04 -132.753 2.22e-16 ***
F_age=61 -7.917e-02 5.947e-04 -133.122 2.22e-16 ***
F_age=62 -7.912e-02 5.949e-04 -133.003 2.22e-16 ***
F_age=63 -7.904e-02 5.954e-04 -132.746 2.22e-16 ***
F_age=64 -7.886e-02 5.956e-04 -132.405 2.22e-16 ***
F_age=65 -7.878e-02 5.952e-04 -132.359 2.22e-16 ***
F_age=66 -7.871e-02 5.961e-04 -132.031 2.22e-16 ***
F_age=67 -7.864e-02 5.963e-04 -131.869 2.22e-16 ***
F_age=68 -7.861e-02 5.966e-04 -131.766 2.22e-16 ***
F_age=69 -7.845e-02 5.967e-04 -131.490 2.22e-16 ***
F_age=70 -7.861e-02 5.965e-04 -131.790 2.22e-16 ***
F_age=71 -7.856e-02 5.970e-04 -131.600 2.22e-16 ***
F_age=72 -7.850e-02 5.971e-04 -131.460 2.22e-16 ***
F_age=73 -7.813e-02 5.977e-04 -130.714 2.22e-16 ***
F_age=74 -7.818e-02 5.981e-04 -130.722 2.22e-16 ***
F_age=75 -7.800e-02 5.986e-04 -130.302 2.22e-16 ***
F_age=76 -7.781e-02 5.993e-04 -129.825 2.22e-16 ***
F_age=77 -7.763e-02 6.002e-04 -129.342 2.22e-16 ***
F_age=78 -7.735e-02 6.009e-04 -128.728 2.22e-16 ***
F_age=79 -7.724e-02 6.024e-04 -128.221 2.22e-16 ***
F_age=80 -7.646e-02 6.045e-04 -126.495 2.22e-16 ***
F_age=81 -7.651e-02 6.060e-04 -126.244 2.22e-16 ***
F_age=82 -7.643e-02 6.081e-04 -125.693 2.22e-16 ***
F_age=83 -7.600e-02 6.109e-04 -124.411 2.22e-16 ***
F_age=84 -7.546e-02 6.145e-04 -122.798 2.22e-16 ***
F_age=85 -7.529e-02 6.183e-04 -121.775 2.22e-16 ***
F_age=86 -7.441e-02 6.259e-04 -118.882 2.22e-16 ***
F_age=87 -7.422e-02 6.324e-04 -117.363 2.22e-16 ***
F_age=88 -7.339e-02 6.463e-04 -113.553 2.22e-16 ***
F_age=89 -7.310e-02 6.569e-04 -111.284 2.22e-16 ***
region=New England Dropped Dropped Dropped Dropped
region=Middle Atlantic 1.710e-03 6.893e-05 24.807 2.22e-16 ***
region=East North Central 3.552e-03 6.867e-05 51.723 2.22e-16 ***
region=West North Central 4.200e-04 7.697e-05 5.457 4.83e-08 ***
region=South Atlantic -1.227e-03 6.521e-05 -18.821 2.22e-16 ***
region=East South Central -7.894e-04 7.793e-05 -10.130 2.22e-16 ***
region=West South Central -5.857e-03 6.732e-05 -87.011 2.22e-16 ***
region=Mountain 1.821e-03 8.060e-05 22.596 2.22e-16 ***
region=Pacific -5.990e-04 6.732e-05 -8.897 2.22e-16 ***
region=Other Dropped Dropped Dropped Dropped
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Tweedie family taken to be 546.4888)
Condition number of final variance-covariance matrix: 5980.277
Number of iterations: 8
A good way to begin examining the results of the estimated model is to look at predicted values for given explanatory characteristics. For example, let’s create a prediction data set for the South Atlantic region for all ages and sexes:
# Get the region factor levels
varInfo <- rxGetVarInfo(propinDS)
regionLabels <- varInfo$region$levels
# Create a prediction data set for region 5, all ages, both sexes
region <- factor(rep(5, times=138), levels = 1:10, labels = regionLabels)
age <- c(21:89, 21:89)
sex <- factor(c(rep(1, times=69), rep(2, times=69)),
levels = 1:2,
labels = c("Male", "Female"))
predData <- data.frame(age, sex, region)
Now we’ll use that as a basis for a similar prediction data set for the Middle Atlantic region:
# Create a prediction data set for region 2, all ages, both sexes
predData2 <- predData
predData2$region <-factor(rep(2, times=138), levels = 1:10,
labels = varInfo$region$levels)
Next we combine the two data sets, and compute the predicted values for annual property insurance cost using our estimated rxGlm model:
predData$predicted <- outData$propinsr_Pred
rxLinePlot( predicted ~age|region+sex, data = predData,
title = "Predicted Annual Property Insurance Costs",
xTitle = "Age of Head of Household",
yTitle = "Predicted Costs")
Stepwise Generalized Linear Models
Stepwise generalized linear models help you determine which variables are most important to include in the model. You provide a minimal, or lower, model formula and a maximal, or upper, model formula. Using forward selection, backward elimination, or bidirectional search, the algorithm determines the model formula that provides the best fit based on an AIC selection criterion or a significance level criterion.
As an example, consider again the Gamma family model from earlier in this article:
claimsXdf <- file.path(rxGetOption("sampleDataDir"),"claims.xdf")
claimsGlm <- rxGlm(cost ~ age + car.age + type, family = Gamma,
dropFirst = TRUE, data = claimsXdf)
summary(claimsGlm)
We can recast this model as a stepwise model by specifying a variableSelection argument using the rxStepControl function to provide our stepwise arguments:
claimsGlmStep <- rxGlm(cost ~ age, family = Gamma, dropFirst=TRUE,
data=claimsXdf, variableSelection =
rxStepControl(scope = ~ age + car.age + type ))
summary(claimsGlmStep)
Call:
rxGlm(formula = cost ~ age, data = claimsXdf, family = Gamma,
variableSelection = rxStepControl(scope = ~age + car.age +
type), dropFirst = TRUE)
Generalized Linear Model Results for: cost ~ car.age + type
File name:
C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\claims.xdf
Dependent variable(s): cost
Total independent variables: 9 (Including number dropped: 2)
Number of valid observations: 123
Number of missing observations: 5
Family-link: Gamma-inverse
Residual deviance: 18.0433 (on 116 degrees of freedom)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0040354 0.0004661 8.657 3.24e-14 ***
car.age=0-3 Dropped Dropped Dropped Dropped
car.age=4-7 0.0003568 0.0004037 0.884 0.37868
car.age=8-9 0.0011825 0.0004688 2.522 0.01302 *
car.age=10+ 0.0035478 0.0006853 5.177 9.57e-07 ***
type=A Dropped Dropped Dropped Dropped
type=B -0.0004512 0.0005519 -0.818 0.41528
type=C -0.0004135 0.0005558 -0.744 0.45837
type=D -0.0016307 0.0004923 -3.313 0.00123 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.2249467)
Condition number of final variance-covariance matrix: 9.1775
Number of iterations: 5
We see that in the stepwise model fit, age no longer appears in the final model.
Plotting Model Coefficients
The ability to save model coefficients using the argument keepStepCoefs = TRUE within the rxStepControl call, and to plot them with the function rxStepPlot was described in great detail for stepwise rxLinMod in Fitting Linear Models using RevoScaleR. This functionality is also available for stepwise rxGLM objects.