rxCrossTabs: Cross Tabulation

Description

Use rxCrossTabs to create contingency tables from cross- classifying factors using a formula interface. It performs equivalent computations to the rxCube function, but returns its results in a different way.

Usage

  rxCrossTabs(formula, data, pweights = NULL, fweights = NULL, means = FALSE,
              marginals = FALSE, cube = FALSE, rowSelection = NULL,
              transforms = NULL, transformObjects = NULL,
              transformFunc = NULL, transformVars = NULL, 
              transformPackages = NULL, transformEnvir = NULL, 
              useSparseCube = rxGetOption("useSparseCube"),            
              removeZeroCounts = useSparseCube, returnXtabs = FALSE, na.rm = FALSE,
              blocksPerRead = rxGetOption("blocksPerRead"),
              reportProgress = rxGetOption("reportProgress"), verbose = 0,
              computeContext = rxGetOption("computeContext"), ...)

 ## S3 method for class `rxCrossTabs':
print  (x, output, header = TRUE, marginals = FALSE,
        na.rm = FALSE, ...)

 ## S3 method for class `rxCrossTabs':
summary  (object, output, type = "%", na.rm = FALSE, ...)

 ## S3 method for class `rxCrossTabs':
as.list  (x, output, marginals = FALSE, na.rm = FALSE, ...)

 ## S3 method for class `rxCrossTabs':
mean  (x, marginals = TRUE, na.rm = FALSE, ...)

Arguments

formula

formula as described in rxFormula with the categorical cross-classifying variables (separated by :) on the right hand side.

data

either a data source object, a character string specifying a .xdf file, or a data frame object containing the cross-classifying variables.

pweights

character string specifying the variable to use as probability weights for the observations.

fweights

character string specifying the variable to use as frequency weights for the observations.

means

logical value. If TRUE, the mean values of the contingency table are also stored in the output object along with the sums and counts. By default, if the mean values are stored, the print and summary methods display them. However, the output argument in those methods can be used to override this behavior by setting output equal to "sums" or "counts".

marginals

logical value. If TRUE, a list of marginal table values is stored as an attribute named "marginals" for each of the contingency tables. Each marginals list contains entries for the row, column and grand totals or means, depending on the type of data table. To access them directly, use the rxMarginals function.

cube

logical value. If TRUE, the C++ cube functionality is called.

rowSelection

name of a logical variable in the data set (in quotes) or a logical expression using variables in the data set to specify row selection. For example, rowSelection = "old" will use only observations in which the value of the variable old is TRUE. rowSelection = (age > 20) & (age < 65) & (log(income) > 10) will use only observations in which the value of the age variable is between 20 and 65 and the value of the log of the income variable is greater than 10. The row selection is performed after processing any data transformations (see the arguments transforms or transformFunc). As with all expressions, rowSelection can be defined outside of the function call using the expression function.

transforms

an expression of the form list(name = expression, ...) representing the first round of variable transformations. As with all expressions, transforms (or rowSelection) can be defined outside of the function call using the expression function.

transformObjects

a named list containing objects that can be referenced by transforms, transformsFunc, and rowSelection.

transformFunc

variable transformation function. The variables used in the transformation function must be specified in transformVars if they are not variables used in the model. See rxTransform for details.

transformVars

character vector of input data set variables needed for the transformation function. See rxTransform for details.

transformPackages

character vector defining additional R packages (outside of those specified in rxGetOption("transformPackages")) to be made available and preloaded for use in variable transformation functions, e.g., those explicitly defined in RevoScaleR functions via their transforms and transformFunc arguments or those defined implicitly via their formula or rowSelection arguments. The transformPackages argument may also be NULL, indicating that no packages outside rxGetOption("transformPackages") will be preloaded.

transformEnvir

user-defined environment to serve as a parent to all environments developed internally and used for variable data transformation. If transformEnvir = NULL, a new "hash" environment with parent baseenv() is used instead.

useSparseCube

logical value. If TRUE, sparse cube is used. For large crosstab computation, R may run out of memory due to the resulting expanded contingency tables even if the internal C++ computation succeeds. In which cases, try to use rxCube instead.

removeZeroCounts

logical flag. If TRUE, rows with no observations will be removed from the contingency tables. By default, it has the same value as useSparseCube. Please note this affects only those zeroed counts in the final contingency table for which there are no observations in the input data. However, if the input data contains a row with frequency zero it will be reported in the final contingency table. This should be set to TRUE if the total number of combinations of factor values on the right-hand side of the formula is significant and as a result R might run out of memory when handling the resulting large contingency table.

returnXtabs

logical flag. If TRUE, an object of class xtabs is returned. Note that the only difference between the structures of an equivalent xtabs call output and the output of rxCrossTabs(..., returnXtabs = TRUE) is that they will contain different "call" attributes. Note also that xtabs expects the cross-classifying variables in the formula to be separated by plus (+) symbols whereas rxCrossTabs expects them to be separated by a colon (:) symbols.

na.rm

logical value. If TRUE, NA values are removed when calculating the marginal means of the contingency tables.

blocksPerRead

number of blocks to read for each chunk of data read from the data source.

reportProgress

integer value: Options are:

  • 0: no progress is reported.
  • 1: the number of processed rows is printed and updated.
  • 2: rows processed and timings are reported.
  • 3: rows processed and all timings are reported.

verbose

integer value. If 0, no additional output is printed. If 1, additional summary information is printed.

computeContext

a valid RxComputeContext. The RxSpark and RxHadoopMR compute contexts distribute the computation among the nodes specified by the compute context; for other compute contexts, the computation is distributed if possible on the local computer.

...

for rxCrossTabs, additional arguments to be passed directly to the base computational function.

x, object

objects of class rxCrossTabs.

output

character string used to specify the type of output to display. Choices are "sums", "counts" and "means".

logical value. If TRUE, header information is printed.

type

character string used to specify the summary to create. Choices are "%" or "percentages" and "chisquare" to summarize the cross-tabulation results with percentages or performs a chi-squared test for independence of factors, respectively.

Details

The output is returned in a list and the print and summary methods can be used to display and summarize the contingency table(s) contained in each element of the output list. The print method produces an output similar to that of the xtabs function. The summary method produces a summary table for each output contingency table and displays the column, row, and total table percentages as well as the counts.

Value

an object of class rxCrossTabs that contains a list of elements described as follows:

sums

list of contingency tables whose values are cross-tabulation sums. This object is NULL if there are no dependent variables specified in the formula. The names of the list objects are built using the dependent variables specified in the formula (if they exist) along with the independent variable factor levels corresponding to each contingency table. For example, z <- rxCrossTabs(ncontrols ~ agegp + alcgp + tobgp, esoph); names(z$sums) will return the character vector with elements "ncontrols, tobgp = 0-9g/day", "ncontrols, tobgp = 10-19", "ncontrols, tobgp = 20-29", "ncontrols, tobgp = 30+". Typically, the user should rely on the print or summary methods to display the cross tabulation results but you can also directly access an individual contingency table using its name in R's standard list data access methods. For example, to access the "ncontrols, tobgp = 10-19" table containing cross tabulation summations you would use z$sums[["ncontrols, tobgp = 10-19"]] or equivalently z$sums[[2]]. To print the entire list of cross-tabulation summations one would issue print(z, output="sums").

counts

list of contingency tables whose values are cross-tabulation counts. The names of the list objects are equivalent to those of the 'sums' output list.

means

list of contingency tables containing cross tabulation mean values. This object is NULL if there are no dependent variables specified in the formula. The 'means' list is returned only if the user has specified means=TRUE in the call to rxCrossTabs. If means=FALSE in the call, mean values still may be calculated and returned using the print and summary methods with an output="means" argument. In this case, the mean values are calculated dynamically. If you wish to have quick access to the means, use means=TRUE in the call to rxCrossTabs. The names of the list objects are equivalent to those of the 'sums' output list.

call

original call to the underlying rxCrossTabs.formula method.

chisquare

list of chi-square tests, one for each cross-tabulation table. Each entry contains the results of a chi-squared test for independence of factors as used in the summary method for the xtabs function. The names of the list objects are equivalent to those of the 'sums' output list.

formula

formula used in the rxCrossTabs call.

depvars

character vector of dependent variable names as extracted from the formula.

Author(s)

Microsoft Corporation Microsoft Technical Support

See Also

xtabs, rxMarginals, rxCube, as.xtabs, rxChiSquaredTest, rxFisherTest, rxKendallCor, rxPairwiseCrossTab, rxRiskRatio, rxOddsRatio, rxTransform.

Examples


 # Basic data.frame source example
 admissions <- as.data.frame(UCBAdmissions)
 admissCTabs <- rxCrossTabs(Freq ~ Gender : Admit, data = admissions)

 # print different outputs and summarize different types
 print(admissCTabs) # same as print(admissCTabs, output = "sums")
 print(admissCTabs, output = "counts")
 print(admissCTabs, output = "means")
 summary(admissCTabs) # same as summary(admissCTabs, type = "%")
 summary(admissCTabs, output="means", type = "%")
 summary(admissCTabs, type = "chisquare")

 # Example using multiple dependent variables in formula 
 rxCrossTabs(ncontrols ~ agegp : alcgp : tobgp, data = esoph)
 rxCrossTabs(ncases ~ agegp : alcgp : tobgp, data = esoph)
 esophCTabs <-
     rxCrossTabs(cbind(ncases, ncontrols) ~ agegp : alcgp : tobgp, esoph)
 esophCTabs

 # Obtaining the mean values
 esophMeans <- mean(esophCTabs, marginals = FALSE)
 esophMeans
 esophMeans <- mean(esophCTabs, marginals = TRUE)
 esophMeans

 # XDF example: small subset of census data
 censusWorkers <- file.path(rxGetOption("sampleDataDir"), "CensusWorkers.xdf")
 censusCTabs <- rxCrossTabs(wkswork1 ~ sex : F(age), data = censusWorkers,
   pweights = "perwt", blocksPerRead = 3)
 censusCTabs
 barplot(censusCTabs$sums$wkswork1/1e6, xlab = "Age (years)",
   ylab = "Population (millions)", beside = TRUE,
   legend.text = c("Male", "Female"))

 # perform a census crosstab, limiting the analysis to ages
 # on the interval [20, 65]. Verify the age range from the output.
 censusXtabAge.20.65 <- rxCrossTabs(wkswork1 ~ sex : F(age), data = censusWorkers,
     rowSelection = age >= 20 & age <= 65)
 ageRange <- range(as.numeric(colnames(censusXtabAge.20.65$sums$wkswork1)))
 (ageRange[1] >= 20 & ageRange[2] <=65)

 # Create a data frame 
 myDF <- data.frame(sex = c("Male", "Male", "Female", "Male"),
   age = c(20, 20, 12, 15), score = 1.1:4.1, sport=c(1:3,2))

 # Use the 'transforms' argument to dynamically transform the
 # variables of the data source. Here, we form a named list of 
 # transformation expressions. To avoid evaluation when assigning
 # to a local variable, we wrap the transformation list with expression().
 transforms <- expression(list(
   ageHalved = age/2,
   sport = factor(sport, labels=c("tennis", "golf", "football"))))
 rxCrossTabs(score ~ sport:sex, data = myDF, transforms = transforms)
 rxCrossTabs(~ sport : F(ageHalved, low = 7, high = 10), data = myDF, 
   transforms = transforms)

 # Arithmetic formula expression only (no transformFunc specification).
 rxCrossTabs(log(score) ~ F(age) : sex, data = myDF)

 # No transformFunc or formula arithmetic expressions.
 rxCrossTabs(score ~ F(age) : sex, data = myDF)

 # Transform a categorical variable to a continuous one and use it
 # as a response variable in the formula for cross-tabulation.
 # The transformation is equivalent to doing the following, which
 # is reflected in the cross-tabulation results.
 #
 #   > as.numeric(as.factor(c(20,20,12,15))) - 1
 #   [1] 2 2 0 1
 # 
 # Note that the effect of N() is to return the factor codes
 myDF <- data.frame(sex = c("Male", "Male", "Female", "Male"),
   age = factor(c(20, 20, 12, 15)), score = factor(1.1:4.1))
 rxCrossTabs(N(age) ~ sex : score, data = myDF)

 # To transform a categorical variable (like age) that has numeric levels
 # (as opposed to codes), use the following construction:
 myDF <- data.frame(sex = c("Male", "Male", "Female", "Male"),
                    age = factor(c(20, 20, 12, 15)), score = factor(1.1:4.1))
 rxCrossTabs(as.numeric(levels(age))[age] ~ sex : score, data = myDF)

 # this should break because 'age' is a categorical variable
 ## Not run:

try(rxCrossTabs(age ~ sex + score, data = myDF))
## End(Not run) 


 # frequency weighting
 fwts <- 1:4
 sex <- c("Male", "Male", "Female", "Male")
 age <- c(20, 20, 12, 15)
 score <- 1.1:4.1

 myDF1 <- data.frame(sex = sex, age = age, score = factor(score), fwts = fwts)
 myDF2 <- data.frame(sex = rep(sex, fwts), age = rep(age, fwts),
   score = factor(rep(score, fwts)))

 mySums1 <- rxCrossTabs(age ~ sex : score, data = myDF1,
   fweights = "fwts")$sums$age[c("Male", "Female"),]
 mySums2 <- rxCrossTabs(age ~ sex : score,
   data = myDF2)$sums$age[c("Male", "Female"),]
 all.equal(mySums1, mySums2)

 # Compare xtabs and rxCrossTabs(..., returnXtabs = TRUE)  
 # results for 3-way interaction, one dependent variable
 set.seed(100)
 divs <- letters[1:5]
 glads <- c("spartacus", "crixus")
 romeDF <- data.frame( division = rep(divs, 5L), 
                       score = runif(25, min = 0, max = 10), 
                       rank = runif(25, min = 1, max = 100), 
                       gladiator = c(rep(glads[1L], 12L), rep(glads[2L], 13L)),
                       arena = sample(c("colosseum", "ludus", "market"), 25L, replace = TRUE))

 z1 <- rxCrossTabs(score ~ division : gladiator : arena, data = romeDF, returnXtabs = TRUE)
 z2 <- xtabs(score ~ division + gladiator + arena, romeDF)
 all.equal(z1, z2, check.attributes = FALSE) # all the same except "call" attribute

 # Compare xtabs and rxCrossTabs(..., returnXtabs = TRUE) 
 # results for 3-way interaction, multiple dependent variable
 z1 <- rxCrossTabs(cbind(score, rank) ~ division : gladiator : arena, data = romeDF, returnXtabs = TRUE, means = TRUE)
 z2 <- xtabs(cbind(score, rank) ~ division + gladiator + arena, romeDF)
 all.equal(z1, z2, check.attributes = FALSE) # all the same except "call" attribute

 # Compare xtabs and rxCrossTabs(..., returnXtabs = TRUE) 
 # results for 3-way interaction, no dependent variable
 z1 <- rxCrossTabs( ~ division : gladiator : arena, data = romeDF, returnXtabs = TRUE, means = TRUE)
 z2 <- xtabs(~ division + gladiator + arena, romeDF)
 all.equal(z1, z2, check.attributes = FALSE) # all the same except "call" attribute

 # removeZeroCounts
 admissions <- as.data.frame(UCBAdmissions)
 admissions[admissions$Dept == "F", "Freq"] <- 0

 # removeZeroCounts does not make a difference for the zero values observed from input data
 crossTab1 <- rxCrossTabs(Freq ~ Dept : Gender, data = admissions, removeZeroCounts = TRUE)
 crossTab2 <- rxCrossTabs(Freq ~ Dept : Gender, data = admissions)
 all.equal(as.data.frame(crossTab1$sums$Freq), as.data.frame(crossTab2$sums$Freq))

 # removeZeroCounts removes the missing values that are not observed from input data
 admissions_NoZero <- admissions[admissions$Dept != "F",]
 crossTab1 <- rxCrossTabs(Freq ~ Dept : Gender, data = admissions,  removeZeroCounts = TRUE, rowSelection = (Freq != 0))
 crossTab2 <- rxCrossTabs(Freq ~ Dept : Gender, data = admissions_NoZero, removeZeroCounts = TRUE)
 all.equal(as.data.frame(crossTab1$sums$Freq), as.data.frame(crossTab2$sums$Freq))