rxCrossTabs: Cross Tabulation
Description
Use rxCrossTabs
to create contingency tables from cross-
classifying
factors using a formula interface. It performs equivalent computations to the
rxCube function, but returns its results in a different way.
Usage
rxCrossTabs(formula, data, pweights = NULL, fweights = NULL, means = FALSE,
marginals = FALSE, cube = FALSE, rowSelection = NULL,
transforms = NULL, transformObjects = NULL,
transformFunc = NULL, transformVars = NULL,
transformPackages = NULL, transformEnvir = NULL,
useSparseCube = rxGetOption("useSparseCube"),
removeZeroCounts = useSparseCube, returnXtabs = FALSE, na.rm = FALSE,
blocksPerRead = rxGetOption("blocksPerRead"),
reportProgress = rxGetOption("reportProgress"), verbose = 0,
computeContext = rxGetOption("computeContext"), ...)
## S3 method for class `rxCrossTabs':
print (x, output, header = TRUE, marginals = FALSE,
na.rm = FALSE, ...)
## S3 method for class `rxCrossTabs':
summary (object, output, type = "%", na.rm = FALSE, ...)
## S3 method for class `rxCrossTabs':
as.list (x, output, marginals = FALSE, na.rm = FALSE, ...)
## S3 method for class `rxCrossTabs':
mean (x, marginals = TRUE, na.rm = FALSE, ...)
Arguments
formula
formula as described in rxFormula with the categorical cross-classifying variables (separated by :
) on the right hand side.
data
either a data source object, a character string specifying a .xdf file, or a data frame object containing the cross-classifying variables.
pweights
character string specifying the variable to use as probability weights for the observations.
fweights
character string specifying the variable to use as frequency weights for the observations.
means
logical value. If TRUE
, the mean values of the contingency table are also stored in the output object along with the sums and counts. By default, if the mean values are stored, the print
and summary
methods display them. However, the output
argument in those methods can be used to override this behavior by setting output
equal to "sums"
or "counts"
.
marginals
logical value. If TRUE
, a list of marginal table values is stored as an attribute named "marginals"
for each of the contingency tables. Each marginals list contains entries for the row, column and grand totals or means, depending on the type of data table. To access them directly, use the rxMarginals function.
cube
logical value. If TRUE
, the C++ cube functionality is called.
rowSelection
name of a logical variable in the data set (in quotes) or a logical expression using variables in the data set to specify row selection. For example, rowSelection = "old"
will use only observations in which the value of the variable old
is TRUE
. rowSelection = (age > 20) & (age < 65) & (log(income) > 10)
will use only observations in which the value of the age
variable is between 20 and 65 and the value of the log
of the income
variable is greater than 10. The row selection is performed after processing any data transformations (see the arguments transforms
or transformFunc
). As with all expressions, rowSelection
can be defined outside of the function call using the expression function.
transforms
an expression of the form list(name = expression, ...)
representing the first round of variable transformations. As with all expressions, transforms
(or rowSelection
) can be defined outside of the function call using the expression function.
transformObjects
a named list containing objects that can be referenced by transforms
, transformsFunc
, and rowSelection
.
transformFunc
variable transformation function. The variables used in the transformation function must be specified in transformVars
if they are not variables used in the model. See rxTransform for details.
transformVars
character vector of input data set variables needed for the transformation function. See rxTransform for details.
transformPackages
character vector defining additional R packages (outside of those specified in rxGetOption("transformPackages")
) to be made available and preloaded for use in variable transformation functions, e.g., those explicitly defined in RevoScaleR functions via their transforms
and transformFunc
arguments or those defined implicitly via their formula
or rowSelection
arguments. The transformPackages
argument may also be NULL
, indicating that no packages outside rxGetOption("transformPackages")
will be preloaded.
transformEnvir
user-defined environment to serve as a parent to all environments developed internally and used for variable data transformation. If transformEnvir = NULL
, a new "hash" environment with parent baseenv()
is used instead.
useSparseCube
logical value. If TRUE
, sparse cube is used. For large crosstab computation, R may run out of memory due to the resulting expanded contingency tables even if the internal C++ computation succeeds. In which cases, try to use rxCube
instead.
removeZeroCounts
logical flag. If TRUE
, rows with no observations will be removed from the contingency tables. By default, it has the same value as useSparseCube
. Please note this affects only those zeroed counts in the final contingency table for which there are no observations in the input data. However, if the input data contains a row with frequency zero it will be reported in the final contingency table. This should be set to TRUE
if the total number of combinations of factor values on the right-hand side of the formula
is significant and as a result R might run out of memory when handling the resulting large contingency table.
returnXtabs
logical flag. If TRUE
, an object of class xtabs is returned. Note that the only difference between the structures of an equivalent xtabs
call output and the output of rxCrossTabs(..., returnXtabs = TRUE)
is that they will contain different "call"
attributes. Note also that xtabs
expects the cross-classifying variables in the formula
to be separated by plus (+) symbols whereas rxCrossTabs
expects them to be separated by a colon (:) symbols.
na.rm
logical value. If TRUE
, NA
values are removed when calculating the marginal means of the contingency tables.
blocksPerRead
number of blocks to read for each chunk of data read from the data source.
reportProgress
integer value: Options are:
0
: no progress is reported.1
: the number of processed rows is printed and updated.2
: rows processed and timings are reported.3
: rows processed and all timings are reported.
verbose
integer value. If 0
, no additional output is printed. If 1
, additional summary information is printed.
computeContext
a valid RxComputeContext. The RxSpark
and RxHadoopMR
compute contexts distribute the computation among the nodes specified by the compute context; for other compute contexts, the computation is distributed if possible on the local computer.
...
for rxCrossTabs
, additional arguments to be passed directly to the base computational function.
x, object
objects of class rxCrossTabs.
output
character string used to specify the type of output to display. Choices are "sums"
, "counts"
and "means"
.
header
logical value. If TRUE
, header information is printed.
type
character string used to specify the summary to create. Choices are "%"
or "percentages"
and "chisquare"
to summarize the cross-tabulation results with percentages or performs a chi-squared test for independence of factors, respectively.
Details
The output is returned in a list and the print
and summary
methods can be used to display and summarize the contingency table(s)
contained in each element of the output list. The print
method produces
an output similar to that of the xtabs function. The
summary
method produces a summary table for each output contingency
table and displays the column, row, and total table percentages as well as the
counts.
Value
an object of class rxCrossTabs that contains a list of elements described as follows:
sums
list of contingency tables whose values are cross-tabulation sums. This object is NULL if there are no dependent variables specified in the formula. The names of the list objects are built using the dependent variables specified in the formula (if they exist) along with the independent variable factor levels corresponding to each contingency table. For example, z <- rxCrossTabs(ncontrols ~ agegp + alcgp + tobgp, esoph); names(z$sums)
will return the character vector with elements "ncontrols, tobgp = 0-9g/day"
, "ncontrols, tobgp = 10-19"
, "ncontrols, tobgp = 20-29"
, "ncontrols, tobgp = 30+"
. Typically, the user should rely on the print
or summary
methods to display the cross tabulation results but you can also directly access an individual contingency table using its name in R's standard list data access methods. For example, to access the "ncontrols, tobgp = 10-19" table containing cross tabulation summations you would use z$sums[["ncontrols, tobgp = 10-19"]]
or equivalently z$sums[[2]]
. To print the entire list of cross-tabulation summations one would issue print(z, output="sums")
.
counts
list of contingency tables whose values are cross-tabulation counts. The names of the list objects are equivalent to those of the 'sums' output list.
means
list of contingency tables containing cross tabulation mean values. This object is NULL if there are no dependent variables specified in the formula. The 'means' list is returned only if the user has specified means=TRUE
in the call to rxCrossTabs. If means=FALSE
in the call, mean values still may be calculated and returned using the print
and summary
methods with an output="means"
argument. In this case, the mean values are calculated dynamically. If you wish to have quick access to the means, use means=TRUE
in the call to rxCrossTabs. The names of the list objects are equivalent to those of the 'sums' output list.
call
original call to the underlying rxCrossTabs.formula
method.
chisquare
list of chi-square tests, one for each cross-tabulation table. Each entry contains the results of a chi-squared test for independence of factors as used in the summary method for the xtabs function. The names of the list objects are equivalent to those of the 'sums' output list.
formula
formula used in the rxCrossTabs
call.
depvars
character vector of dependent variable names as extracted from the formula.
Author(s)
Microsoft Corporation Microsoft Technical Support
See Also
xtabs, rxMarginals, rxCube, as.xtabs, rxChiSquaredTest, rxFisherTest, rxKendallCor, rxPairwiseCrossTab, rxRiskRatio, rxOddsRatio, rxTransform.
Examples
# Basic data.frame source example
admissions <- as.data.frame(UCBAdmissions)
admissCTabs <- rxCrossTabs(Freq ~ Gender : Admit, data = admissions)
# print different outputs and summarize different types
print(admissCTabs) # same as print(admissCTabs, output = "sums")
print(admissCTabs, output = "counts")
print(admissCTabs, output = "means")
summary(admissCTabs) # same as summary(admissCTabs, type = "%")
summary(admissCTabs, output="means", type = "%")
summary(admissCTabs, type = "chisquare")
# Example using multiple dependent variables in formula
rxCrossTabs(ncontrols ~ agegp : alcgp : tobgp, data = esoph)
rxCrossTabs(ncases ~ agegp : alcgp : tobgp, data = esoph)
esophCTabs <-
rxCrossTabs(cbind(ncases, ncontrols) ~ agegp : alcgp : tobgp, esoph)
esophCTabs
# Obtaining the mean values
esophMeans <- mean(esophCTabs, marginals = FALSE)
esophMeans
esophMeans <- mean(esophCTabs, marginals = TRUE)
esophMeans
# XDF example: small subset of census data
censusWorkers <- file.path(rxGetOption("sampleDataDir"), "CensusWorkers.xdf")
censusCTabs <- rxCrossTabs(wkswork1 ~ sex : F(age), data = censusWorkers,
pweights = "perwt", blocksPerRead = 3)
censusCTabs
barplot(censusCTabs$sums$wkswork1/1e6, xlab = "Age (years)",
ylab = "Population (millions)", beside = TRUE,
legend.text = c("Male", "Female"))
# perform a census crosstab, limiting the analysis to ages
# on the interval [20, 65]. Verify the age range from the output.
censusXtabAge.20.65 <- rxCrossTabs(wkswork1 ~ sex : F(age), data = censusWorkers,
rowSelection = age >= 20 & age <= 65)
ageRange <- range(as.numeric(colnames(censusXtabAge.20.65$sums$wkswork1)))
(ageRange[1] >= 20 & ageRange[2] <=65)
# Create a data frame
myDF <- data.frame(sex = c("Male", "Male", "Female", "Male"),
age = c(20, 20, 12, 15), score = 1.1:4.1, sport=c(1:3,2))
# Use the 'transforms' argument to dynamically transform the
# variables of the data source. Here, we form a named list of
# transformation expressions. To avoid evaluation when assigning
# to a local variable, we wrap the transformation list with expression().
transforms <- expression(list(
ageHalved = age/2,
sport = factor(sport, labels=c("tennis", "golf", "football"))))
rxCrossTabs(score ~ sport:sex, data = myDF, transforms = transforms)
rxCrossTabs(~ sport : F(ageHalved, low = 7, high = 10), data = myDF,
transforms = transforms)
# Arithmetic formula expression only (no transformFunc specification).
rxCrossTabs(log(score) ~ F(age) : sex, data = myDF)
# No transformFunc or formula arithmetic expressions.
rxCrossTabs(score ~ F(age) : sex, data = myDF)
# Transform a categorical variable to a continuous one and use it
# as a response variable in the formula for cross-tabulation.
# The transformation is equivalent to doing the following, which
# is reflected in the cross-tabulation results.
#
# > as.numeric(as.factor(c(20,20,12,15))) - 1
# [1] 2 2 0 1
#
# Note that the effect of N() is to return the factor codes
myDF <- data.frame(sex = c("Male", "Male", "Female", "Male"),
age = factor(c(20, 20, 12, 15)), score = factor(1.1:4.1))
rxCrossTabs(N(age) ~ sex : score, data = myDF)
# To transform a categorical variable (like age) that has numeric levels
# (as opposed to codes), use the following construction:
myDF <- data.frame(sex = c("Male", "Male", "Female", "Male"),
age = factor(c(20, 20, 12, 15)), score = factor(1.1:4.1))
rxCrossTabs(as.numeric(levels(age))[age] ~ sex : score, data = myDF)
# this should break because 'age' is a categorical variable
## Not run:
try(rxCrossTabs(age ~ sex + score, data = myDF))
## End(Not run)
# frequency weighting
fwts <- 1:4
sex <- c("Male", "Male", "Female", "Male")
age <- c(20, 20, 12, 15)
score <- 1.1:4.1
myDF1 <- data.frame(sex = sex, age = age, score = factor(score), fwts = fwts)
myDF2 <- data.frame(sex = rep(sex, fwts), age = rep(age, fwts),
score = factor(rep(score, fwts)))
mySums1 <- rxCrossTabs(age ~ sex : score, data = myDF1,
fweights = "fwts")$sums$age[c("Male", "Female"),]
mySums2 <- rxCrossTabs(age ~ sex : score,
data = myDF2)$sums$age[c("Male", "Female"),]
all.equal(mySums1, mySums2)
# Compare xtabs and rxCrossTabs(..., returnXtabs = TRUE)
# results for 3-way interaction, one dependent variable
set.seed(100)
divs <- letters[1:5]
glads <- c("spartacus", "crixus")
romeDF <- data.frame( division = rep(divs, 5L),
score = runif(25, min = 0, max = 10),
rank = runif(25, min = 1, max = 100),
gladiator = c(rep(glads[1L], 12L), rep(glads[2L], 13L)),
arena = sample(c("colosseum", "ludus", "market"), 25L, replace = TRUE))
z1 <- rxCrossTabs(score ~ division : gladiator : arena, data = romeDF, returnXtabs = TRUE)
z2 <- xtabs(score ~ division + gladiator + arena, romeDF)
all.equal(z1, z2, check.attributes = FALSE) # all the same except "call" attribute
# Compare xtabs and rxCrossTabs(..., returnXtabs = TRUE)
# results for 3-way interaction, multiple dependent variable
z1 <- rxCrossTabs(cbind(score, rank) ~ division : gladiator : arena, data = romeDF, returnXtabs = TRUE, means = TRUE)
z2 <- xtabs(cbind(score, rank) ~ division + gladiator + arena, romeDF)
all.equal(z1, z2, check.attributes = FALSE) # all the same except "call" attribute
# Compare xtabs and rxCrossTabs(..., returnXtabs = TRUE)
# results for 3-way interaction, no dependent variable
z1 <- rxCrossTabs( ~ division : gladiator : arena, data = romeDF, returnXtabs = TRUE, means = TRUE)
z2 <- xtabs(~ division + gladiator + arena, romeDF)
all.equal(z1, z2, check.attributes = FALSE) # all the same except "call" attribute
# removeZeroCounts
admissions <- as.data.frame(UCBAdmissions)
admissions[admissions$Dept == "F", "Freq"] <- 0
# removeZeroCounts does not make a difference for the zero values observed from input data
crossTab1 <- rxCrossTabs(Freq ~ Dept : Gender, data = admissions, removeZeroCounts = TRUE)
crossTab2 <- rxCrossTabs(Freq ~ Dept : Gender, data = admissions)
all.equal(as.data.frame(crossTab1$sums$Freq), as.data.frame(crossTab2$sums$Freq))
# removeZeroCounts removes the missing values that are not observed from input data
admissions_NoZero <- admissions[admissions$Dept != "F",]
crossTab1 <- rxCrossTabs(Freq ~ Dept : Gender, data = admissions, removeZeroCounts = TRUE, rowSelection = (Freq != 0))
crossTab2 <- rxCrossTabs(Freq ~ Dept : Gender, data = admissions_NoZero, removeZeroCounts = TRUE)
all.equal(as.data.frame(crossTab1$sums$Freq), as.data.frame(crossTab2$sums$Freq))