microsoftml.categorical:將文字資料行轉換成類別

使用方式

microsoftml.categorical(cols: [str, dict, list], output_kind: ['Bag', 'Ind',
    'Key', 'Bin'] = 'Ind', max_num_terms: int = 1000000,
    terms: int = None, sort: ['Occurrence', 'Value'] = 'Occurrence',
    text_key_values: bool = False, **kargs)

Description

在訓練模型之前,您可以在資料上執行類別轉換。

詳細資料

categorical 轉換會透過資料集傳遞,並在文字資料行上運作以建置類別的字典。 針對每個資料列,輸入資料行中出現的整個文字字串會定義為類別。 類別轉換的輸出是指標向量。 此向量中的每個位置都會對應至字典中的類別,因此其長度是所建置字典的大小。 類別轉換可套用至一或多個資料行,在此情況下,其會針對套用的每個資料行建置個別的字典。

categorical 目前不支援處理因數資料。

引數

cols

要轉換的字元字串或變數名稱清單。 若是 dict,則索引鍵代表要建立的新變數名稱。

output_kind

指定輸出種類的字元字串。

  • "Bag":輸出多組向量。 若輸入資料行是類別的向量,則輸出會包含一個向量,其中每個位置中的值都是輸入向量中類別的出現次數。 若輸入資料行包含單一類別,則指標向量與包向量相等

  • "Ind":輸出指標向量。 輸入資料行是類別的向量,而輸出會在輸入資料行中每個位置都包含一個指標向量。

  • "Key":輸出索引。 輸出是類別的整數識別碼 (介於 1 與目錄中的類別數目之間)。

  • "Bin":輸出向量,其為類別的二進位標記法。

預設值是 "Ind"

max_num_terms

指定要包含在字典中類別數目上限的整數。 預設值為 1000000。

terms

詞彙或類別的選擇性字元向量。

sort

指定排序準則的字元字串。

  • "Occurrence":依出現次數排序類別。 最常見的優先。

  • "Value":依值排序類別。

text_key_values

索引鍵值中繼資料是否應該是文字 (不論實際的輸入類型為何)。

kargs

傳送至計算引擎的其他引數。

傳回

定義轉換的物件。

另請參閱

categorical_hash

範例

'''
Example on rx_logistic_regression and categorical.
'''
import numpy
import pandas
from microsoftml import rx_logistic_regression, categorical, rx_predict

train_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Do not like it", "Really like it",
        "I hate it", "I like it a lot", "I kind of hate it", "I do like it",
        "I really hate it", "It is very good", "I hate it a bunch", "I love it a bunch",
        "I hate it", "I like it very much", "I hate it very much.",
        "I really do love it", "I really do hate it", "Love it!", "Hate it!",
        "I love it", "I hate it", "I love it", "I hate it", "I love it"],
    like=[True, False, True, False, True, False, True, False, True, False,
        True, False, True, False, True, False, True, False, True, False, True,
        False, True, False, True]))
        
test_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Really like it", "I hate it",
        "I like it a lot", "I love it", "I do like it", "I really hate it", "I love it"]))

# Use a categorical transform: the entire string is treated as a category
out_model = rx_logistic_regression("like ~ reviewCat",
                data=train_reviews,
                ml_transforms=[categorical(cols=dict(reviewCat="review"))])
                
# Note that 'I hate it' and 'I love it' (the only strings appearing more than once)
# have non-zero weights.
print(out_model.coef_)

# Use the model to score.
source_out_df = rx_predict(out_model, data=test_reviews, extra_vars_to_write=["review"])
print(source_out_df.head())

輸出:

Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Not adding a normalizer.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Warning: Too few instances to use 4 threads, decreasing to 1 thread(s)
Beginning optimization
num vars: 20
improvement criterion: Mean Improvement
L1 regularization selected 3 of 20 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:01.6550695
Elapsed time: 00:00:00.2259981
OrderedDict([('(Bias)', 0.21317288279533386), ('I hate it', -0.7937591671943665), ('I love it', 0.19668534398078918)])
Beginning processing data.
Rows Read: 10, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.1385248
Finished writing 10 rows.
Writing completed.
           review PredictedLabel     Score  Probability
0   This is great           True  0.213173     0.553092
1       I hate it          False -0.580586     0.358798
2         Love it           True  0.213173     0.553092
3  Really like it           True  0.213173     0.553092
4       I hate it          False -0.580586     0.358798