RxMissingValues
revoscalepy.RxMissingValues
Description
revoscalepy package uses Pandas dataframe as the abstraction to hold the data and manipulate it. Pandas dataframe in turn uses NumPy ndarray to hold homogeneous data efficiently and provides fast access and manipulation functions on the ndarray. Python and NumPy ndarray in particular only has numpy.NaN value for float types to indicate missing values. For other primitive datatypes, like int, there isn’t a missing value notation. If one uses Python’s None object to represent missing value in a series of data, say int, the whole ndarray’s dtype changes to object. This makes dealing with data with missing values quite inefficient during processing.
This class provides missing values for various NumPy data types which one can use to mark missing values in a sequence of data in ndarray.
It provides missing value for following types:
int8() for numpy.int8 with value of numpy.iinfo(np.int8).min
uint8() for numpy.uint8 with value of numpy.iinfo(np.uint8).max
int16() for numpy.int16 with value of numpy.iinfo(np.int16).min
uint16() for numpy.uint16 with value of numpy.iinfo(np.uint16).max
int32() for numpy.int32 with value of numpy.iinfo(np.int32).min
uint32() for numpy.uint32 with value of numpy.iinfo(np.uint32).max
int64() for numpy.int64 with value of numpy.iinfo(np.int64).min
uint64() for numpy.uint64 with value of numpy.iinfo(np.uint64).max
float16() for numpy.float16 with value of numpy.NaN
float32() for numpy.float32 with value of numpy.NaN
float64() for numpy.float64 with value of numpy.NaN
Example
## Not run:
from revoscalepy import RxXdfData, RxSqlServerData, RxInSqlServer
from revoscalepy import RxOptions, rx_logit, RxMissingValues
from revoscalepy.functions.RxLogit import RxLogitResults
import os
def transform_late (data, context):
import numpy
#
# create a new 'Late' column based on 'ArrDelay' int32 column
#
data['Late'] = data['ArrDelay'] > 15
#
# replace all 'Late' with NaN for 'NA' or missing values in ArrDelay
#
data.loc[data['ArrDelay'] == RxMissingValues.int32(), ('Late')] = numpy.NaN
return data
transformVars = ["ArrDelay"]
sample_data_path = RxOptions.get_option("sampleDataDir")
xdf_file = os.path.join(sample_data_path, "AirlineDemoSmall.xdf")
data = RxXdfData(xdf_file)
model = rx_logit("Late~CRSDepTime + DayOfWeek", data=data, transform_function=transform_late, transform_variables=transformVars)
assert isinstance(model, RxLogitResults)
## End(Not run)