rxFactors: Factor Variable Recoding
Description
Recodes a factor variable by mapping one set of factor levels and indices to a new set. Can also be used to convert non-factor variable into a factor.
Usage
rxFactors(inData, factorInfo, sortLevels = FALSE, otherLevel = NULL,
outFile = NULL, varsToKeep = NULL, varsToDrop = NULL,
overwrite = FALSE, maxRowsByCols = NULL,
blocksPerRead = rxGetOption("blocksPerRead"),
reportProgress = rxGetOption("reportProgress"), verbose = 0,
xdfCompressionLevel = rxGetOption("xdfCompressionLevel"), ...)
Arguments
inData
either an RxXdfData object, a character string specifying the .xdf file, or a data frame.
factorInfo
character vector of variable names,a list of named variable information lists, or empty or NULL
. If sortLevels
is set to TRUE
, the levels of the variables named in the character vector will all be sorted; if sortLevels
is TRUE
and factorInfo
is empty or NULL
, all factors will be sorted. If a factorInfo
list is provided, each variable information list contains one or more of the named elements given below.
Currently available properties for a column information list are:
levels
- optional vector, containing values to match in converting non-factor data to factor levels. Iflevels = NULL
, all of the unique values in the data are converted to levels. in the order encountered. However, the user can override this behavior and sort the resulting levels alphabetically by settingsortLevels = TRUE
. The user may also specify a subset of the data to convert to levels. In this case, ifotherLevel = NULL
, all data values not found in thelevels
subset will be converted to missing (NA
) values. For example, if a variablex
is comprised of integer data1
,2
,3
,4
,5
, then`` -
factorInfo = list(x = list(levels = 2:4, otherLevel = NULL))
will convertx
into a factor with dataNA
,"2"
,"3"
,"4"
,NA
with levels"2"
,"3"
,"4"
. Alternatively, the user may wish to place all of those unspecified values into a single category, say"other"
. In that case, useotherLevel = "other"
along with the subsetlevels
specification. Note that thelevels
vector may be any type, e.g., 'integer', 'numeric', 'character'. However, behind the scenes, it is always converted to type 'character', as are the data values being converted. The resulting strings are matched with those of the data to populate the categories.otherLevel
- character string defining the level to assign to all factor values that are not listed in thenewLevels
field, ifnewLevels
is specified. IfotherLevel = NULL
, the default, the factor levels that are not listed innewLevels
will be left unchanged and in their original order. If specified, the value set here overrides the default argument of the same in the primary argument list.sortLevels
- logical scalar. IfTRUE
, the resulting levels will be sorted alphabetically. If the input variable is not a factor and levels are not specified, this will be ignored and levels will be in the order in which they are encountred.varName
- character string defining the name of an existing data variable to recode. If this field is left unspecified, then the name of the corresponding list element infactorInfo
will be used. For example, all of the following are acceptable and equivalentfactorInfo
specifications for alphabetically sorting the levels of an existing factor variable named"myFacVar"
:`` -
factorInfo = list( myFacVar = list( sortLevels = TRUE ) )
`` -
factorInfo = list( list( sortLevels = TRUE, varName = "myFacVar" ) )
`` -
factorInfo = list( myFacVar = list( sortLevels = TRUE, varName = "myFacVar" ) )
However, if you wish to rename a variable after conversion (keeping the old variable in tact), there is only one acceptable format: the variable to be recoded must appear in thevarName
field while the new variable name for the converted data must appear as the name of the corresponding list element. For example, to sort the levels of an existing factor variable"myFacVar"
and store the result in a new variable"myNewVar"
, you would issue:`` -
factorInfo = list( myNewFacVar = list( sortLevels = TRUE, varName = "myFacVar" ) )
newLevels
- a character vector or list, possibly with named elements, used to rename the levels of a factor. Whilelevels
provides a means of filtering the data the user wishes to import and convert to factor levels,newLevels
is used to alter converted or existing levels by renaming, collapsing, or sorting them. See the Examples section below for typical use cases ofnewLevels
.description
- character string defining a description for the recoded variable.
sortLevels
the default value to use for the sortLevels
field in the factorInfo
list.
otherLevel
the default value to use for the otherLevel
field in the factorInfo
list.
outFile
either an RxXdfData object, a character string specifying the .xdf file, or NULL
. If outFile = NULL
, a data frame is returned. When writing to HDFS, outFile
must be an RxXdfData
object representing a new composite XDF.
varsToKeep
character vector of variable names to include in the data file. If NULL
, argument is ignored. Cannot be used with varsToDrop
.
varsToDrop
character vector of variable names to not include in the data file. If NULL
, argument is ignored. Cannot be used with varsToKeep
.
overwrite
logical value. If TRUE
, an existing outFile
will be overwritten. Ignored if a dataframe is returned.
maxRowsByCols
this argument is used only when inData
is referring to an .xdf file (character string defining a path to an existing .xdf file or an RxXdfData object) and we wish to return the output as a data frame (outFile = NULL
). In this case, and behind the scenes, the output is written to a temporary .xdf file and rxDataStep
is subsequently called to convert the output into a data frame. The maxRowsByCols
argument is passed directly in the rxDataStep
call, giving the user some control over the conversion. See rxDataStep for more details on the maxRowsByCols
argument.
blocksPerRead
number of blocks to read for each chunk of data read from the data source. If the data
and outFile
are the same file, blocksPerRead must be 1.
reportProgress
integer value with options:
0
: no progress is reported.1
: the number of processed rows is printed and updated.2
: rows processed and timings are reported.3
: rows processed and all timings are reported.
verbose
integer value. If 0
, no additional output is printed. If 1
, additional summary information is printed.
xdfCompressionLevel
integer in the range of -1 to 9. The higher the value, the greater the amount of compression - resulting in smaller files but a longer time to create them. If xdfCompressionLevel
is set to 0, there will be no compression and files will be compatible with the 6.0 release of Revolution R Enterprise. If set to -1, a default level of compression will be used.
...
additional arguments to be passed directly to the Microsoft R Services Compute Engine.
Details
Factors are variables that represent categories. An example is a variable named "state"
whose values are the levels "Alabama"
, "Alaska"
, ..., "Wyoming"
.
There are two parts to a factor variable:
1
a vector of N (number of observations) integer indexes
with values in the range of 1:K
, where K is the number of categories.
1 a vector of K strings (characters) that are used when the vector is displayed and in some other situations.
For instance, when state levels are alphabetical, all observations for which state == "Alabama"
will have the index 1
, state == "Washington"
values correspond to index 47
,
and so on.
Recoding a factor means changing from one set of indices to another. For instance, if the levels
for "state"
are currently arranged in the order in which they were encountered when importing a .csv
file, and it is desired to put them in alphabetical order, then it is necessary to change the index for every
observation.
If numeric data is converted to a factor, a maximum precision of 6 is used. So, for example, the values 7.123456 and 7.12346 would be placed in the same category.
To recode a categorical or factor variable into a continuous variable within a
formula use N()
. To recode continuous variable to a categorical or factor
variable within a formula use F()
. See rxFormula.
To rename the levels of a factor variable in an .xdf file (without change the levels themselves), use rxSetVarInfoXdf.
Value
if outFile
is NULL
, then a data frame is returned. Otherwise, the results
are written to the specified outFile
file and an RxXdfData object is returned
invisibly corresponding to the output file.
Author(s)
Microsoft Corporation Microsoft Technical Support
See Also
rxFormula, rxSetVarInfoXdf, rxImport, rxDataStep.
Examples
###
# Example 1: Recoding levels in alphabetical order
###
# Use the 'warpbreaks' data frame found in the 'datasets' package
# Note that the 'tension' factor variable has levels that are not
# alphabetically ordered.
rxGetVarInfo( warpbreaks )
# Reorder all factor levels that are not in alphabetical order
recodedDF1 <- rxFactors(inData = warpbreaks, sortLevels = TRUE)
rxGetVarInfo( recodedDF1 )
# Specify that only 'tension' levels should be reordered alphabetically
recodedDF2 <- rxFactors(inData = warpbreaks, sortLevels = TRUE,
factorInfo = c("tension"))
rxGetVarInfo( recodedDF2 )
# Specify that only 'tension' levels should be reordered alphabetically using a list
recodedDF3 <- rxFactors(inData = warpbreaks,
factorInfo = list(tension = list(sortLevels = TRUE)))
rxGetVarInfo( recodedDF3 )
# write data frame to .xdf file and perform similar recoding
# but write the recoded factor to a new variable. Compare the
# original with the recoded factor.
inXDF <- file.path(tempdir(), "warpbreaks.xdf")
outXDF <- file.path(tempdir(), "warpbreaksRecoded.xdf")
rxDataStep(warpbreaks, outFile = inXDF, overwrite = TRUE)
outDS <- rxFactors(inData = inXDF, outFile = outXDF, overwrite = TRUE,
factorInfo = list(recodedTension = list(sortLevels = TRUE,
varName = "tension")))
DF <- rxDataStep(outDS)
rxGetVarInfo( DF )
# clean up
if (file.exists(inXDF)) unlink(inXDF)
if (file.exists(outXDF)) unlink(outXDF)
###
# Example 2: Recoding levels and indexes, saving recoding to a new factor variable
###
# Create an .xdf file with a factor variable named 'sex' with levels 'M and 'F'
set.seed(100)
sex <- factor(sample(c("M","F"), size = 10, replace = TRUE), levels = c("M", "F"))
DF <- data.frame(sex = sex, score = rnorm(10))
DF[["sex"]]
XDF <- file.path(tempdir(), "sex.xdf")
XDF2 <- file.path(tempdir(), "newSex.xdf")
rxDataStep(DF, outFile = XDF, overwrite = TRUE)
# Assume that we change our minds and now wish to
# rename the levels to "Female" and "Male"
# Let us do the recoding and store the result into a new
# variable named "Gender" keeping the old variable in place.
outDS <- rxFactors(inData = XDF, outFile = XDF2, overwrite = TRUE,
factorInfo = list(Gender = list(newLevels = c(Female = "F", Male = "M"),
varName = "sex")))
newDF <- rxDataStep(outDS)
print(newDF)
# clean up
if (file.exists(XDF)) unlink(XDF)
if (file.exists(XDF)) unlink(XDF2)
###
# Example 3: Combining subsets of factor levels into single levels
###
# Create a data set that contains a factor variable 'Month'
# Note that the levels are not in alphabetical order.
set.seed(100)
DF <- data.frame(Month = factor(sample(month.name, size = 20, replace = TRUE),
levels = rev(month.name)))
# Recode the months into quarters and store result into new variable named "Quarter"
recodedDF <- rxFactors(inData = DF,
factorInfo = list(Quarter = list(newLevels = list(Q1 = month.name[1:3],
Q2 = month.name[4:6],
Q3 = month.name[7:9],
Q4 = month.name[10:12]),
varName = "Month")))
head(recodedDF)
recodedDF$Quarter
###
# Example 4: Coding and recoding combinations using a single factorInfo list
###
set.seed(100)
size <- 10
months <- factor(sample(month.name, size = size, replace = TRUE), levels = rev(month.name))
states <- factor(sample(state.name, size = size, replace = TRUE), levels = state.name)
animalFarm <- c("cow","horse","pig","goat","chicken", "dog", "cat")
animals <- factor(sample(animalFarm, size = size, replace = TRUE), levels = animalFarm)
values <- sample.int(100, size = size, replace = TRUE)
dblValues <- c(1, 2.1, 3.12, 4.123, 5.1234, 6.12345, 7.123456, 7.12346, 81234.56789, 91234567.8)
DF <- data.frame(Month = months, State = states, Animal = animals, VarInt = values,
VarDbl = dblValues, NotUsed1 = seq(size), NotUsed2 = rev(seq(size)))
factorInfo <- list(
# Convert months to quarters
Quarter = list(newLevels = list(Q1 = month.name[1:3], Q2 = month.name[4:6],
Q3 = month.name[7:9], Q4 = month.name[10:12]),
varName = "Month"),
# Sort animal levels
Animal = list(sortLevels = TRUE),
# Convert integer data to factor and do not sort levels
VarIntFac = list(varName = "VarInt", sortLevels = FALSE),
# Convert double data to factor; it will use a precision up to 6
VarDblFac = list(varName = "VarDbl"),
# In-place arbitrary grouping of state names using indexMap
StateSide = list(newLevels = c(LeftState = "1", RightState = "2"),
indexMap = c(rep(1, 25), rep(2, 25)),
varName = "State")
)
rxFactors(DF, factorInfo)
###
# Example 5: Using 'newLevels' to rename, reorder, or collapse existing factor levels.
# All of these examples make use of the iris data set, which contains levels
# "setosa", "versicolor", and "virginica", in that order.
###
# Renaming factor levels:
#
# "setosa" to "Seto"
# "versicolor" to "Vers"
# "virginica" to "Virg"
newLevels <- list(Seto = "setosa", Vers = "versicolor", Virg = "virginica")
rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels)))$Species
# Reordering:
newLevels <- c("versicolor", "setosa", "virginica")
rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels)))$Species
# Collapsing: order does matter here, so the resulting order of the levels will
# be "V" then "S". The 'sortLevels' argument is a quick means of alphabetically
# sorting the resultant level names.
newLevels <- list(V = "setosa", S = c("versicolor", "virginica"))
rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels)))$Species
rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels, sortLevels = TRUE)))$Species
# Subset collapsing with renaming: accomplish with the use of 'otherLevel'
newLevels <- list(S = "setosa")
rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels, otherLevel = "otherSpecies")))$Species
# Superset specification: adding new species for a future study
newLevels <- c("setosa", "versicolor", "virginica", "pumila", "narbuti", "camillae")
rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels)))$Species