rxKmeans: K-Means Clustering
Description
Perform k-means clustering on small or large data.
Usage
rxKmeans(formula, data,
outFile = NULL, outColName = ".rxCluster",
writeModelVars = FALSE, extraVarsToWrite = NULL,
overwrite = FALSE, numClusters = NULL, centers = NULL,
algorithm = "Lloyd", numStartRows = 0, maxIterations = 1000,
numStarts = 1, rowSelection = NULL,
transforms = NULL, transformObjects = NULL,
transformFunc = NULL, transformVars = NULL,
transformPackages = NULL, transformEnvir = NULL,
blocksPerRead = rxGetOption("blocksPerRead"),
reportProgress = rxGetOption("reportProgress"), verbose = 0,
computeContext = rxGetOption("computeContext"),
xdfCompressionLevel = rxGetOption("xdfCompressionLevel"), ...)
## S3 method for class `rxKmeans':
print (x, header = TRUE, ...)
Arguments
formula
formula as described in rxFormula.
data
a data source object, a character string specifying a .xdf file, or a data frame object.
outFile
either an RxXdfData data source object or a character string specifying the .xdf file for storing the resulting cluster indexes. If NULL
, then no cluster indexes are stored to disk. Note that in the case that the input data is a data frame, the cluster indexes are returned automatically. Note also that, if rowSelection
is specified and not NULL
, then outFile
cannot be the same as the data
since the resulting set of cluster indexes will generally not have the same number of rows as the original data source.
outColName
character string to be used as a column name for the resulting cluster indexes if outFile
is not NULL
. Note that make.names is used on outColName
to ensure that the column name is valid. If the outFile
is an RxOdbcData
source, dots are first converted to underscores. Thus, the default outColName
becomes "X_rxCluster"
.
writeModelVars
logical value. If TRUE
, and the output file is different from the input file, variables in the model will be written to the output file in addition to the cluster numbers. If variables from the input data set are transformed in the model, the transformed variables will also be written out.
extraVarsToWrite
NULL
or character vector of additional variables names from the input data to include in the outFile
. If writeModelVars
is TRUE
, model variables will be included as well.
overwrite
logical value. If TRUE
, an existing outFile
with an existing column named outColName
will be overwritten.
numClusters
number of clusters k
to create. If NULL
, then the centers
argument must be specified.
centers
a k x p
numeric matrix containing a set of initial (distinct) cluster centers. If NULL
, then the numClusters
argument must be specified.
algorithm
character string defining algorithm to use in defining the clusters. Currently supported algorithms are "Lloyd"
. This argument is case insensitive.
numStartRows
integer specifying the size of the sample used to choose initial centers. If 0, (the default), the size is chosen as the minimum of the number of observations or 10 times the number of clusters.
maxIterations
maximum number of iterations allowed.
numStarts
if centers
is NULL
, k
rows are randomly selected from the data source for use as initial starting points. The numStarts
argument defines the number of these random sets that are to be chosen and evaluated, and the best result is returned. If numStarts
is 0, the first k
rows in the data set are used. Random selection of rows is only supported for .xdf data sources using the native file system and data frames. If the .xdf file is compressed, the random sample is taken from a maximum of the first 5000 rows of data.
rowSelection
name of a logical variable in the data set (in quotes) or a logical expression using variables in the data set to specify row selection. For example, rowSelection = "old"
will use only observations in which the value of the variable old
is TRUE
. rowSelection = (age > 20) & (age < 65) & (log(income) > 10)
will use only observations in which the value of the age
variable is between 20 and 65 and the value of the log
of the income
variable is greater than 10. The row selection is performed after processing any data transformations (see the arguments transforms
or transformFunc
). As with all expressions, rowSelection
can be defined outside of the function call using the expression function.
transforms
an expression of the form list(name = expression, ...)
representing the first round of variable transformations. As with all expressions, transforms
(or rowSelection
) can be defined outside of the function call using the expression function.
transformObjects
a named list containing objects that can be referenced by transforms
, transformsFunc
, and rowSelection
.
transformFunc
variable transformation function. See rxTransform for details.
transformVars
character vector of input data set variables needed for the transformation function. See rxTransform for details.
transformPackages
character vector defining additional R packages (outside of those specified in rxGetOption("transformPackages")
) to be made available and preloaded for use in variable transformation functions, e.g., those explicitly defined in RevoScaleR functions via their transforms
and transformFunc
arguments or those defined implicitly via their formula
or rowSelection
arguments. The transformPackages
argument may also be NULL
, indicating that no packages outside rxGetOption("transformPackages")
will be preloaded.
transformEnvir
user-defined environment to serve as a parent to all environments developed internally and used for variable data transformation. If transformEnvir = NULL
, a new "hash" environment with parent baseenv()
is used instead.
blocksPerRead
number of blocks to read for each chunk of data read from the data source.
reportProgress
integer value with options:
0
: no progress is reported.1
: the number of processed rows is printed and updated.2
: rows processed and timings are reported.3
: rows processed and all timings are reported.
verbose
integer value. If 0
, no additional output is printed. If 1
, additional summary information is printed.
computeContext
a valid RxComputeContext. The RxSpark
and RxHadoopMR
compute contexts distribute the computation among the nodes specified by the compute context; for other compute contexts, the computation is distributed if possible on the local computer.
xdfCompressionLevel
integer in the range of -1 to 9 indicating the compression level for the output data if written to an .xdf
file. The higher the value, the greater the amount of compression - resulting in smaller files but a longer time to create them. If xdfCompressionLevel
is set to 0, there will be no compression and files will be compatible with the 6.0 release of Revolution R Enterprise. If set to -1, a default level of compression will be used.
...
additional arguments to be passed directly to the Revolution Compute Engine.
x
object of class rxKmeans.
header
logical value. If TRUE
, header information is printed.
Details
Performs scalable k-means clustering using the classical Lloyd algorithm.
For reproducibility when using random starting values, you can pass a random seed
by specifying seed=
value as part of your call. See the Examples.
Value
An object of class "rxKmeans" which is a list with components:
cluster
A vector of integers indicating the cluster to which each point is allocated. This information is always returned if the data source is a data frame. If the data source is not a data frame and outFile
is specified. i.e., not NULL
, the cluster indexes are written/appended to the specified file with a column name as defined by outColName
.
centers
matrix of cluster centers.
withinss
within-cluster sum of squares (relative to the center) for each cluster.
totss
total within-cluster sum of squares.
tot.withinss
sum of the withinss
vector.
betweenss
between-cluster sum of squares.
size
number of points in each cluster.
valid.obs
number of valid observations.
missing.obs
number of missing observations.
numIterations
number iterations performed.
params
parameters sent to Microsoft R Services Compute Engine.
formula
formula as described in rxFormula.
call
the matched call.
Author(s)
Microsoft Corporation Microsoft Technical Support
References
Lloyd, S. P. (1957, 1982) Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory 28, 128-137.
See Also
kmeans.
Examples
# Create data
N <- 1000
sd1 <- 0.3
mean1 <- 0
sd2 <- 0.5
mean2 <- 1
set.seed(10)
data <- rbind(matrix(rnorm(N, sd = sd1, mean = mean1), ncol = 2),
matrix(rnorm(N, mean = mean2, sd = sd2), ncol = 2))
colnames(data) <- c("x", "y")
DF <- data.frame(data)
XDF <- paste(tempfile(), "xdf", sep=".")
if (file.exists(XDF)) file.remove(XDF)
rxDataStep(inData = DF, outFile = XDF)
centers <- DF[sample.int(NROW(DF), 2, replace = TRUE),] # grab 2 random rows for starting
# Example using an XDF file as a data source
rxKmeans(~ x + y, data = XDF, centers = centers)
# Example using a local data frame file as a data source
z <- rxKmeans(~ x + y, data = DF, centers = centers)
# Show a plot of the results
# By design, the data in 2-space populates two groups of points centered about
# points (0,0) and (1,1). The spread about the mean is based on a random set of
# points drawn from a Gaussian distribution with standard deviations 0.3 and 0.5.
# As a visual, the resulting plot shows two circles drawn at the centers with radii
# equal to the corresponding standard deviations.
plot(DF, col = z$cluster, asp = 1,
main = paste("Lloyd k-means Clustering: ", z$numIterations, "iterations"))
symbols(mean1, mean1, circle = sd1, inches = FALSE, add = TRUE, fg = "black", lwd = 2)
symbols(mean2, mean2, circle = sd2, inches = FALSE, add = TRUE, fg = "red", lwd = 2)
points(z$centers, col = 2:1, bg = 1:2, pch = 21, cex = 2) # big filled dots for centers
# Example using randomly selected rows from data source as initial centers
# but with seed set for reproducibility
## Not run:
z <- rxKmeans(~ x + y, data = DF, numClusters = 2, seed=18)
## End(Not run)
# Example using first rows from Spss data source as initial centers
spssFile <- file.path(rxGetOption("sampleDataDir"),"claims.sav")
spssDS <- RxSpssData(spssFile, colClasses = c(cost = "integer"))
resultSpss <- rxKmeans(~cost, data = spssDS, numClusters = 2, numStarts = 0)