Cluster classification in RevoScaleR
Important
This content is being retired and may not be updated in the future. The support for Machine Learning Server will end on July 1, 2022. For more information, see What's happening to Machine Learning Server?
Clustering is the general name for any of a large number of classification techniques that involve assigning observations to membership in one of two or more clusters on the basis of some distance metric.
K-means Clustering
K-means clustering is a classification technique that groups observations of numeric data using one of several iterative relocation algorithms. Starting from some initial classification, which may be random, points are moved from cluster to another so as to minimize sums of squares. In RevoScaleR, the algorithm used is that of Lloyd.
To perform k-means clustering with RevoScaleR, use the rxKmeans function.
Clustering the Airline Data
As a first example of k-means clustering, we will cluster the arrival delay and scheduled departure time in the airline data 7% subsample. To start, we extract variables of interest into a new working data set to which we are writing additional information:
# K-means Clustering
# Clustering the Airline Data
bigDataDir <- "C:/MRS/Data"
sampleAirData <- file.path(bigDataDir, "AirOnTime7Pct.xdf")
rxDataStep(inData = sampleAirData, outFile = "AirlineDataClusterVars.xdf",
varsToKeep=c("DayOfWeek", "ArrDelay", "CRSDepTime", "DepDelay"))
We specify the variables to cluster as a formula, and specify the number of clusters we’d like. Initial centers for these clusters are then chosen at random.
kclusts1 <- rxKmeans(formula= ~ArrDelay + CRSDepTime,
data = "AirlineDataClusterVars.xdf",
seed = 10,
outFile = "airlineDataClusterVars.xdf", numClusters=5)
kclusts1
This produces the following output (because the initial centers are chosen at random, your output will probably look different):
Call:
rxKmeans(formula = ~ArrDelay + CRSDepTime, data = "AirlineDataClusterVars.xdf",
outFile = "AirlineDataClusterVars.xdf", numClusters = 5)
Data: "AirlineDataClusterVars.xdf"
Number of valid observations: 10186272
Number of missing observations: 213483
Clustering algorithm:
K-means clustering with 5 clusters of sizes 922985, 38192, 4772791, 261779, 4190525
Cluster means:
ArrDelay CRSDepTime
1 45.258179 14.86596
2 275.363820 14.81432
3 -10.284426 13.08375
4 118.365205 15.52079
5 7.803893 13.53811
Within cluster sum of squares by cluster:
1 2 3 4 5
223220709 501736748 354763376 233533349 312403604
Available components:
[1] "centers" "size" "withinss" "valid.obs"
[5] "missing.obs" "numIterations" "tot.withinss" "totss"
[9] "betweenss" "cluster" "params" "formula"
[13] "call"
The value returned by rxKmeans is a list similar to the list returned by the standard R kmeans function. The printed output shows a subset of this information, including the number of valid and missing observations, the cluster sizes, the cluster centers, and the within-cluster sums of squares.
The cluster membership component is returned if the input is a data frame, but if the input is a .xdf file, cluster membership is returned only if outFile is specified, in which case it is returned not as part of the return object, but as a column in the specified file. In our example, we specified an outFile, and we see the cluster membership variable when we look at the file with rxGetInfo:
rxGetInfo("AirlineDataClusterVars.xdf", getVarInfo=TRUE)
File name: AirlineDataClusterVars.xdf
Number of observations: 10399755
Number of variables: 5
Number of blocks: 19
Compression type: zlib
Variable information:
Var 1: DayOfWeek
7 factor levels: Mon Tues Wed Thur Fri Sat Sun
Var 2: ArrDelay, Type: integer, Low/High: (-1233, 2453)
Var 3: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0000, 24.0000)
Var 4: DepDelay, Type: integer, Low/High: (-1199, 2467)
Var 5: .rxCluster, Type: integer, Low/High: (1, 5)
Using the Cluster Membership Information
A common follow-up to clustering is to use the cluster membership information to see whether a given model varies appreciably from cluster to cluster. Since we can use the rowSelection argument to extract a single cluster on the fly, there is no need to sort the data first. As an example, we fit our original linear model of ArrDelay by DayOfWeek for two of the clusters:
# Using the Cluster Membership Information
clust1Lm <- rxLinMod(ArrDelay ~ DayOfWeek, "AirlineDataClusterVars.xdf",
rowSelection = .rxCluste r == 1 )
clust5Lm <- rxLinMod(ArrDelay ~ DayOfWeek, "AirlineDataClusterVars.xdf",
rowSelection = .rxCluster == 5)
summary(clust1Lm)
summary(clust5Lm)
Looking at the summary for clust1Lm shows the following:
Call:
rxLinMod(formula = ArrDelay ~ DayOfWeek, data = "AirlineDataClusterVars.xdf",
rowSelection = .rxCluster == 1)
Linear Regression Results for: ArrDelay ~ DayOfWeek
File name: AirlineDataClusterVars.xdf
Dependent variable(s): ArrDelay
Total independent variables: 8 (Including number dropped: 1)
Number of valid observations: 922985
Number of missing observations: 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.21591 0.04237 1067.199 2.22e-16 ***
DayOfWeek=Mon 0.23053 0.05893 3.912 9.16e-05 ***
DayOfWeek=Tues -0.06496 0.05968 -1.089 0.2764
DayOfWeek=Wed 0.10139 0.05869 1.727 0.0841 .
DayOfWeek=Thur 0.06098 0.05708 1.068 0.2854
DayOfWeek=Fri 0.23222 0.05660 4.103 4.08e-05 ***
DayOfWeek=Sat -0.43444 0.06364 -6.827 8.68e-12 ***
DayOfWeek=Sun Dropped Dropped Dropped Dropped
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.89 on 922978 degrees of freedom
Multiple R-squared: 0.0001705
Adjusted R-squared: 0.000164
F-statistic: 26.24 on 6 and 922978 DF, p-value: < 2.2e-16
Condition number: 12.8655
Similarly, the summary for clust5Lm shows the following:
Call:
rxLinMod(formula = ArrDelay ~ DayOfWeek, data = "AirlineDataClusterVars.xdf",
rowSelection = .rxCluster == 5)
Linear Regression Results for: ArrDelay ~ DayOfWeek
File name: AirlineDataClusterVars.xdf
Dependent variable(s): ArrDelay
Total independent variables: 8 (Including number dropped: 1)
Number of valid observations: 4190525
Number of missing observations: 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.808093 0.009593 813.960 2.22e-16 ***
DayOfWeek=Mon -0.131001 0.013320 -9.835 2.22e-16 ***
DayOfWeek=Tues -0.228087 0.013374 -17.055 2.22e-16 ***
DayOfWeek=Wed -0.035954 0.013292 -2.705 0.00683 **
DayOfWeek=Thur 0.231958 0.013170 17.613 2.22e-16 ***
DayOfWeek=Fri 0.313961 0.013171 23.838 2.22e-16 ***
DayOfWeek=Sat -0.257716 0.014036 -18.361 2.22e-16 ***
DayOfWeek=Sun Dropped Dropped Dropped Dropped
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.238 on 4190518 degrees of freedom
Multiple R-squared: 0.0007911
Adjusted R-squared: 0.0007897
F-statistic: 553 on 6 and 4190518 DF, p-value: < 2.2e-16
Condition number: 12.0006