microsoftml.categorical_hash:将文本列进行哈希处理并转换为类别

使用情况

microsoftml.categorical_hash(cols: [str, dict, list],
    hash_bits: int = 16, seed: int = 314489979,
    ordered: bool = True, invert_hash: int = 0,
    output_kind: ['Bag', 'Ind', 'Key', 'Bin'] = 'Bag', **kargs)

说明

可在训练模型之前对数据执行的分类哈希转换。

详细信息

categorical_hash 通过将值哈希化并将哈希值用作包中索引,将分类值转换为指示器数组。 如果输入列是一个向量,则为其返回单个指示器包。 categorical_hash 当前不支持处理系数数据。

参数

cols

要转换的字符串或变量名称列表。 如果是 dict,则键表示要创建的新变量的名称。

hash_bits

一个整数,指定要哈希到的位数。 必须介于 1 和 30 之间(含限值)。 默认值为 16。

seed

一个指定哈希种子的整数。 默认值为 314489979。

ordered

True 表示包含每个条件在哈希中的位置。 否则为 False。 默认值是 True

invert_hash

一个整数,指定可用于生成槽名称的键数限制。 0 表示无反转哈希;-1 表示无限制。 虽然零值提供更好的性能,但需要非零值才能获得有意义的系数名称。 默认值是 0

output_kind

指定输出类型的字符串。

  • "Bag":输出一个多集向量。 如果输入列是类别向量,则输出将包含一个向量,其中每个槽中的值为该类别在输入向量中出现的次数。 如果输入列仅包含一个类别,则指示器矢量和包向量等效

  • "Ind":输出指示器向量。 输入列是类别向量,而且输出在输入列中的每个槽中包含一个指示器向量。

  • "Key:输出索引。 输出是一个类别的整数 ID(介于 1 和字典中的类别数之间)。

  • "Bin:输出一个向量,该向量是类别的二进制表示形式。

默认值是 "Bag"

kargs

发送到计算引擎的其他参数。

返回

一个定义转换的对象。

请参阅

categorical

示例

'''
Example on rx_logistic_regression and categorical_hash.
'''
import numpy
import pandas
from microsoftml import rx_logistic_regression, categorical_hash, rx_predict
from microsoftml.datasets.datasets import get_dataset

movie_reviews = get_dataset("movie_reviews")

train_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Do not like it", "Really like it",
        "I hate it", "I like it a lot", "I kind of hate it", "I do like it",
        "I really hate it", "It is very good", "I hate it a bunch", "I love it a bunch",
        "I hate it", "I like it very much", "I hate it very much.",
        "I really do love it", "I really do hate it", "Love it!", "Hate it!",
        "I love it", "I hate it", "I love it", "I hate it", "I love it"],
    like=[True, False, True, False, True, False, True, False, True, False,
        True, False, True, False, True, False, True, False, True, False, True,
        False, True, False, True]))
        
test_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Really like it", "I hate it",
        "I like it a lot", "I love it", "I do like it", "I really hate it", "I love it"]))


# Use a categorical hash transform.
out_model = rx_logistic_regression("like ~ reviewCat",
                data=train_reviews,
                ml_transforms=[categorical_hash(cols=dict(reviewCat="review"))])
                
# Weights are similar to categorical.
print(out_model.coef_)

# Use the model to score.
source_out_df = rx_predict(out_model, data=test_reviews, extra_vars_to_write=["review"])
print(source_out_df.head())

输出:

Not adding a normalizer.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Warning: Too few instances to use 4 threads, decreasing to 1 thread(s)
Beginning optimization
num vars: 65537
improvement criterion: Mean Improvement
L1 regularization selected 3 of 65537 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.1209392
Elapsed time: 00:00:00.0190134
OrderedDict([('(Bias)', 0.2132447361946106), ('f1783', -0.7939924597740173), ('f38537', 0.1968022584915161)])
Beginning processing data.
Rows Read: 10, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0284223
Finished writing 10 rows.
Writing completed.
           review PredictedLabel     Score  Probability
0   This is great           True  0.213245     0.553110
1       I hate it          False -0.580748     0.358761
2         Love it           True  0.213245     0.553110
3  Really like it           True  0.213245     0.553110
4       I hate it          False -0.580748     0.358761