microsoftml.rx_predict:使用 Microsoft 机器学习模型进行评分

使用情况

microsoftml.rx_predict(model,
    data: typing.Union[revoscalepy.datasource.RxDataSource.RxDataSource,
    pandas.core.frame.DataFrame],
    output_data: typing.Union[revoscalepy.datasource.RxDataSource.RxDataSource,
    str] = None, write_model_vars: bool = False,
    extra_vars_to_write: list = None, suffix: str = None,
    overwrite: bool = False, data_threads: int = None,
    blocks_per_read: int = None, report_progress: int = None,
    verbose: int = 1,
    compute_context: revoscalepy.computecontext.RxComputeContext.RxComputeContext = None,
    **kargs)

说明

结合使用经过训练的 Microsoft ML 机器学习模型和 arevoscalepydata 源,在数据帧或 revoscalepy 数据源中报告每个实例的评分结果。

详细信息

默认情况下,输出中报告以下项:针对二元分类器的三个变量的评分:PredictedLabel、Score 和 Probability;Score 用于 oneClassSvm 和回归分类器;PredictedLabel 用于多类分类器,以及 Score 前面预置的每个类别的变量。

参数

模型

从 microsoftml 模型返回的模型信息对象。 例如,从 rx_fast_treesrx_logistic_regression 返回的对象。

data

revoscalepy 数据源对象、数据帧或指向 .xdf 文件的路径。

output_data

输出文本或 xdf 文件名或其中具有写入功能的 RxDataSource,用于存储转换后的数据。 如果为 None,则返回数据帧。 默认值为“None”。

write_model_vars

如果为 True,则除了评分变量外,还会将模型中的变量写入到输出数据集。 如果输入数据集中的变量在模型中进行了转换,则转换后的变量也会包括在内。 默认值是 False

extra_vars_to_write

要包括在 output_data 中的 None 或输入数据中的其他变量名的字符向量。 如果 write_model_varsTrue,则还包括模型变量。 默认值是 None

suffix

一个字符串,指定要追加到创建的评分变量的后缀,或指定 None 表示没有后缀。 默认值是 None

overwrite

如果为 True,则覆盖现有的 output_data;如果为 False,则不覆盖现有的 output_data。 默认值是 False

data_threads

一个整数,指定数据管道中所需的并行度。 如果为 None,则使用的线程数在内部确定。 默认值为“None”。

blocks_per_read

为从数据源读取的每个数据块指定要读取的块数。

report_progress

一个整数值,指定行处理进度的报告级别:

  • 0:不报告进度。

  • 1:打印并更新已处理的行数。

  • 2:报告已处理的行数和计时。

  • 3:报告已处理的行数和所有计时。

默认值是 1

verbose

一个整数值,指定需要的输出量。 如果为 0,则计算期间不会打印详细输出。 从 14 的整数值表示提供的信息量逐步增加。 默认值是 1

compute_context

设置执行计算的上下文,使用有效的 revoscalepy.RxComputeContext 指定。 目前支持本地和 revoscalepy.RxInSqlServer 计算上下文。

kargs

发送到计算引擎的其他参数。

返回

表示创建的输出数据的数据帧或 revoscalepy.RxDataSource 对象。 默认情况下,评分二元分类器的输出包含三个变量:PredictedLabelScoreProbabilityrx_oneclass_svm 和回归包含一个变量:Score;多类分类器包含 PredictedLabel 加上 Score 前面预置的每个类别的变量。 如果提供了 suffix,则会将其添加到这些输出变量名称的末尾。

请参阅

rx_featurizerevoscalepy.rx_data_steprevoscalepy.rx_import

二元分类示例

'''
Binary Classification.
'''
import numpy
import pandas
from microsoftml import rx_fast_linear, rx_predict
from revoscalepy.etl.RxDataStep import rx_data_step
from microsoftml.datasets.datasets import get_dataset

infert = get_dataset("infert")

import sklearn
if sklearn.__version__ < "0.18":
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split

infertdf = infert.as_df()
infertdf["isCase"] = infertdf.case == 1
data_train, data_test, y_train, y_test = train_test_split(infertdf, infertdf.isCase)

forest_model = rx_fast_linear(
    formula=" isCase ~ age + parity + education + spontaneous + induced ",
    data=data_train)
    
# RuntimeError: The type (RxTextData) for file is not supported.
score_ds = rx_predict(forest_model, data=data_test,
                     extra_vars_to_write=["isCase", "Score"])
                     
# Print the first five rows
print(rx_data_step(score_ds, number_rows_read=5))

输出:

Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Beginning processing data.
Rows Read: 186, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 186, Read Time: 0.001, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 186, Read Time: 0.001, Transform Time: 0
Beginning processing data.
Using 2 threads to train.
Automatically choosing a check frequency of 2.
Auto-tuning parameters: maxIterations = 8064.
Auto-tuning parameters: L2 = 2.666837E-05.
Auto-tuning parameters: L1Threshold (L1/L2) = 0.
Using best model from iteration 590.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.6058289
Elapsed time: 00:00:00.0084728
Beginning processing data.
Rows Read: 62, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0302359
Finished writing 62 rows.
Writing completed.
Rows Read: 5, Total Rows Processed: 5, Total Chunk Time: 0.001 seconds 
  isCase PredictedLabel     Score  Probability
0  False           True  0.576775     0.640325
1  False          False -2.929549     0.050712
2   True          False -2.370090     0.085482
3  False          False -1.700105     0.154452
4  False          False -0.110981     0.472283

回归示例

'''
Regression.
'''
import numpy
import pandas
from microsoftml import rx_fast_trees, rx_predict
from revoscalepy.etl.RxDataStep import rx_data_step
from microsoftml.datasets.datasets import get_dataset

airquality = get_dataset("airquality")

import sklearn
if sklearn.__version__ < "0.18":
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split

airquality = airquality.as_df()


######################################################################
# Estimate a regression fast forest
# Use the built-in data set 'airquality' to create test and train data

df = airquality[airquality.Ozone.notnull()]
df["Ozone"] = df.Ozone.astype(float)

data_train, data_test, y_train, y_test = train_test_split(df, df.Ozone)

airFormula = " Ozone ~ Solar_R + Wind + Temp "

# Regression Fast Forest for train data
ff_reg = rx_fast_trees(airFormula, method="regression", data=data_train)

# Put score and model variables in data frame
score_df = rx_predict(ff_reg, data=data_test, write_model_vars=True)
print(score_df.head())

# Plot actual versus predicted values with smoothed line
# Supported in the next version.
# rx_line_plot(" Score ~ Ozone ", type=["p", "smooth"], data=score_df)

输出:

'unbalanced_sets' ignored for method 'regression'
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
Beginning processing data.
Rows Read: 87, Read Time: 0.001, Transform Time: 0
Beginning processing data.
Warning: Skipped 4 instances with missing features during training
Processed 83 instances
Binning and forming Feature objects
Reserved memory for tree learner: 22620 bytes
Starting to train ...
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.0390764
Elapsed time: 00:00:00.0080750
Beginning processing data.
Rows Read: 29, Read Time: 0.001, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0221875
Finished writing 29 rows.
Writing completed.
   Solar_R  Wind  Temp      Score
0    290.0   9.2  66.0  33.195541
1    259.0  15.5  77.0  20.906796
2    276.0   5.1  88.0  76.594643
3    139.0  10.3  81.0  31.668842
4    236.0  14.9  81.0  43.590839