microsoftml.categorical_hash：哈希并将文本列转换为类别

2025-01-02

用法

microsoftml.categorical_hash(cols: [str, dict, list],
    hash_bits: int = 16, seed: int = 314489979,
    ordered: bool = True, invert_hash: int = 0,
    output_kind: ['Bag', 'Ind', 'Key', 'Bin'] = 'Bag', **kargs)

描述

可在训练模型之前对数据执行分类哈希转换。

详

categorical_hash 通过对值进行哈希处理并将哈希用作包中的索引，将分类值转换为指示器数组。如果输入列是向量，则会为其返回单个指示器包。 categorical_hash 目前不支持处理因子数据。

参数

cols

要转换的变量名称的字符串或列表。如果 dict，则键表示要创建的新变量的名称。

hash_bits

一个整数，指定要进行哈希处理的位数。必须介于 1 到 30 之间（含）。默认值为 16。

种子

指定哈希种子的整数。默认值为314489979。

命令

True 将每个术语的位置包含在哈希中。否则，False。默认值为 True。

invert_hash

一个整数，指定可用于生成槽名称的键数限制。 0 表示无反转哈希;-1 意味着没有限制。虽然零值可提供更好的性能，但需要非零值才能获取有意义的系数名称。默认值为 0。

output_kind

一个指定输出类型类型的字符串。

"Bag"：输出多集向量。如果输入列是类别的向量，则输出包含一个向量，其中每个槽中的值是输入向量中类别的出现次数。如果输入列包含单个类别，则指示器向量和包向量是等效的
"Ind"：输出指示器向量。输入列是类别的向量，输出包含输入列中每个槽的一个指示器向量。
"Key：输出索引。输出是类别的整数 ID（介于 1 和字典中的类别数）。
"Bin：输出作为类别的二进制表示形式的向量。

默认值为 "Bag"。

kargs

发送到计算引擎的其他参数。

定义转换的对象。

另请参阅

categorical

例

'''
Example on rx_logistic_regression and categorical_hash.
'''
import numpy
import pandas
from microsoftml import rx_logistic_regression, categorical_hash, rx_predict
from microsoftml.datasets.datasets import get_dataset

movie_reviews = get_dataset("movie_reviews")

train_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Do not like it", "Really like it",
        "I hate it", "I like it a lot", "I kind of hate it", "I do like it",
        "I really hate it", "It is very good", "I hate it a bunch", "I love it a bunch",
        "I hate it", "I like it very much", "I hate it very much.",
        "I really do love it", "I really do hate it", "Love it!", "Hate it!",
        "I love it", "I hate it", "I love it", "I hate it", "I love it"],
    like=[True, False, True, False, True, False, True, False, True, False,
        True, False, True, False, True, False, True, False, True, False, True,
        False, True, False, True]))
        
test_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Really like it", "I hate it",
        "I like it a lot", "I love it", "I do like it", "I really hate it", "I love it"]))


# Use a categorical hash transform.
out_model = rx_logistic_regression("like ~ reviewCat",
                data=train_reviews,
                ml_transforms=[categorical_hash(cols=dict(reviewCat="review"))])
                
# Weights are similar to categorical.
print(out_model.coef_)

# Use the model to score.
source_out_df = rx_predict(out_model, data=test_reviews, extra_vars_to_write=["review"])
print(source_out_df.head())

输出：

Not adding a normalizer.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Warning: Too few instances to use 4 threads, decreasing to 1 thread(s)
Beginning optimization
num vars: 65537
improvement criterion: Mean Improvement
L1 regularization selected 3 of 65537 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.1209392
Elapsed time: 00:00:00.0190134
OrderedDict([('(Bias)', 0.2132447361946106), ('f1783', -0.7939924597740173), ('f38537', 0.1968022584915161)])
Beginning processing data.
Rows Read: 10, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0284223
Finished writing 10 rows.
Writing completed.
           review PredictedLabel     Score  Probability
0   This is great           True  0.213245     0.553110
1       I hate it          False -0.580748     0.358761
2         Love it           True  0.213245     0.553110
3  Really like it           True  0.213245     0.553110
4       I hate it          False -0.580748     0.358761