Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest SQL, Fabric and Power BI learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
microsoftml.rx_featurize(data: typing.Union[revoscalepy.datasource.RxDataSource.RxDataSource,
pandas.core.frame.DataFrame],
output_data: typing.Union[revoscalepy.datasource.RxDataSource.RxDataSource,
str] = None, overwrite: bool = False,
data_threads: int = None, random_seed: int = None,
max_slots: int = 5000, ml_transforms: list = None,
ml_transform_vars: list = None, row_selection: str = None,
transforms: dict = None, transform_objects: dict = None,
transform_function: str = None,
transform_variables: list = None,
transform_packages: list = None,
transform_environment: dict = None, blocks_per_read: int = None,
report_progress: int = None, verbose: int = 1,
compute_context: revoscalepy.computecontext.RxComputeContext.RxComputeContext = None)
Transforms data from an input data set to an output data set.
A revoscalepy data source object, a data frame, or the path
to a .xdf
file.
Output text or xdf file name or an RxDataSource
with
write capabilities in which to store transformed data. If None, a data
frame is returned. The default value is None.
If True
, an existing output_data
is overwritten;
if False
an existing output_data
is not overwritten. The default
value is False
.
An integer specifying the desired degree of parallelism in the data pipeline. If None, the number of threads used is determined internally. The default value is None.
Specifies the random seed. The default value is None.
Max slots to return for vector valued columns (<=0 to return all).
Specifies a list of MicrosoftML transforms to be
performed on the data before training or None if no transforms are
to be performed. See featurize_text
,
categorical
,
and categorical_hash
, for transformations that are supported.
These transformations are performed after any specified Python transformations.
The default value is None.
Specifies a character vector of variable names
to be used in ml_transforms
or None if none are to be used.
The default value is None.
NOT SUPPORTED. Specifies the rows (observations) from the data set that are to be used by the model with the name of a logical variable from the data set (in quotes) or with a logical expression using variables in the data set. For example:
row_selection = "old"
will only use observations in which the value of the variable old
is True
.
row_selection = (age > 20) & (age < 65) & (log(income) > 10)
only uses observations in which the value of the age
variable is between 20 and 65 and the value of the log
of the income
variable is greater than 10.
The row selection is performed after processing any data
transformations (see the arguments transforms
or
transform_function
). As with all expressions, row_selection
can be
defined outside of the function call using the expression
function.
NOT SUPPORTED. An expression of the form that represents
the first round of variable transformations. As with
all expressions, transforms
(or row_selection
) can be defined
outside of the function call using the expression
function.
The default value is None.
NOT SUPPORTED. A named list that contains objects that can be
referenced by transforms
, transform_function
, and
row_selection
. The default value is None.
The variable transformation function. The default value is None.
A character vector of input data set variables needed for the transformation function. The default value is None.
NOT SUPPORTED. A character vector specifying additional Python packages
(outside of those specified in RxOptions.get_option("transform_packages")
) to
be made available and preloaded for use in variable transformation functions.
For example, those explicitly defined in revoscalepy functions via
their transforms
and transform_function
arguments or those defined
implicitly via their formula
or row_selection
arguments. The
transform_packages
argument may also be None, indicating that
no packages outside RxOptions.get_option("transform_packages")
are preloaded.
NOT SUPPORTED. A user-defined environment to serve as a parent to all
environments developed internally and used for variable data transformation.
If transform_environment = None
, a new "hash" environment with parent
revoscalepy.baseenv is used instead The default value is None.
Specifies the number of blocks to read for each chunk of data read from the data source.
An integer value that specifies the level of reporting on the row processing progress:
0
: no progress is reported.
1
: the number of processed rows is printed and updated.
2
: rows processed and timings are reported.
3
: rows processed and all timings are reported.
The default value is 1
.
An integer value that specifies the amount of output wanted.
If 0
, no verbose output is printed during calculations. Integer
values from 1
to 4
provide increasing amounts of information.
The default value is 1
.
Sets the context in which computations are executed, specified with a valid revoscalepy.RxComputeContext. Currently local and revoscalepy.RxInSqlServer compute contexts are supported.
A data frame or an revoscalepy.RxDataSource object representing the created output data.
rx_predict
,
revoscalepy.rx_data_step,
revoscalepy.rx_import.
'''
Example with rx_featurize.
'''
import numpy
import pandas
from microsoftml import rx_featurize, categorical
# rx_featurize basically allows you to access data from the MicrosoftML transforms
# In this example we'll look at getting the output of the categorical transform
# Create the data
categorical_data = pandas.DataFrame(data=dict(places_visited=[
"London", "Brunei", "London", "Paris", "Seria"]),
dtype="category")
print(categorical_data)
# Invoke the categorical transform
categorized = rx_featurize(data=categorical_data,
ml_transforms=[categorical(cols=dict(xdatacat="places_visited"))])
# Now let's look at the data
print(categorized)
Output:
places_visited
0 London
1 Brunei
2 London
3 Paris
4 Seria
Beginning processing data.
Rows Read: 5, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 5, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0521300
Finished writing 5 rows.
Writing completed.
places_visited xdatacat.London xdatacat.Brunei xdatacat.Paris \
0 London 1.0 0.0 0.0
1 Brunei 0.0 1.0 0.0
2 London 1.0 0.0 0.0
3 Paris 0.0 0.0 1.0
4 Seria 0.0 0.0 0.0
xdatacat.Seria
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
Events
Mar 31, 11 PM - Apr 2, 11 PM
The biggest SQL, Fabric and Power BI learning event. March 31 – April 2. Use code FABINSIDER to save $400.
Register today