rxCreateColInfo: Function to generate a 'colInfo' list from a data source
Generates a colInfo
list from a data source that can be used in rxImport
or
an RxDataSource
constructor.
rxCreateColInfo(data, includeLowHigh = FALSE, factorsOnly = FALSE,
varsToKeep = NULL, sortLevels = FALSE, computeInfo = TRUE,
useFactorIndex = FALSE)
An RxDataSource object, a character string containing an .xdf file name, or a data frame. An object returned from rxGetVarInfo is also supported.
If TRUE
, the low/high values will be included in the colInfo
object. Note that this will override any actual low/high values in the data set if the colInfo
object is applied to a different data source.
If TRUE
, only column information for factor variables will be included in the output.
NULL
to include all variables, or character vector of variables to include.
If TRUE
, factor levels will be sorted. If factor levels represent integers, they will be put in numeric order.
If TRUE
, a pass through the data will be taken for non-xdf data sources in order to compute factor levels and low/high values.
If TRUE
, the factorIndex
variable type will be used instead of factor
.
This function can be used to ensure consistent factor levels when importing a series of text files to xdf. It is also useful for repeated analysis on non-xdf data sources.
A colInfo
list that can be used as input for rxImport and in data sources such as
RxTextData and RxSqlServerData.
Microsoft Corporation Microsoft Technical Support
RxDataSource-class, RxTextData, RxSqlServerData, RxSpssData, RxSasData, RxOdbcData, RxTeradata, RxXdfData, rxImport.
# Get the low/high values and factor levels before using a data source
# for import or analysis
# Create a text data source, specifying the 'yearsEmploy' should be a factor
mort1 <- file.path(rxGetOption("sampleDataDir"), "mortDefaultSmall2000.csv")
mort1DS <- RxTextData(file = mort1, colClasses = c(yearsEmploy = "factor", default = "logical"))
# By default, rxCreateColInfo will make a pass through the data to compute factor levels
# and low/high values. We'll also request that the levels be sorted
mortColInfo <- rxCreateColInfo(data = mort1DS, includeLowHigh = TRUE, sortLevels = TRUE)
# Re-create the data source, now using the computed colInfo
mort1DS <- RxTextData(file = mort1, colInfo = mortColInfo)
# Import the data
mort1DF <- rxImport(mort1DS)
levels(mort1DF$yearsEmploy)
# Or use the text data source directly in an analysis
# (not needing a pass through the data to compute the factor levels)
logitObj <- rxLogit(default~yearsEmploy, data = mort1DS)
##############################################################################################
# Train a model on one imported data set, then score using another
# Train a model on the first year of the data, importing it from text to a data frame
mort1 <- file.path(rxGetOption("sampleDataDir"), "mortDefaultSmall2000.csv")
mort1DS <- RxTextData(file = mort1, colClasses = c(yearsEmploy = "factor", default = "logical"))
# Since we haven't specified factor levels, they will be created 'first come, first serve'
mort1DF <- rxImport(mort1DS)
levels(mort1DF$yearsEmploy)
# Estimate a logit model
logitObj <- rxLogit(default~yearsEmploy, data = mort1DF)
# Now import the second year of data
mort2 <- file.path(rxGetOption("sampleDataDir"), "mortDefaultSmall2001.csv")
mort2DS <- RxTextData(file = mort2, colClasses = c(yearsEmploy = "factor", default = "logical"))
mort2DF <- rxImport(mort2DS)
# The levels are in a different order
levels(mort2DF$yearsEmploy)
# If we try to use the model estimated from the first data set to predict on the seoond,
# predOut <- rxPredict(logitObj, data = mort2DF)
# We will get an error
#ERROR: order of factor levels in the data are inconsistent with
#the order of the model coefficients
# Instead, we can extract the colInfo from the first data set
mortColInfo <- rxCreateColInfo(data = mort1DF)
# And use it when importing the second
mort2DS <- RxTextData(file = mort2, colInfo = mortColInfo)
mort2DF <- rxImport(mort2DS)
predOut <- rxPredict(logitObj, data = mort2DF)
head(predOut)