microsoftml.categorical:將文字資料行轉換成類別

使用方式

microsoftml.categorical(cols: [str, dict, list], output_kind: ['Bag', 'Ind',
    'Key', 'Bin'] = 'Ind', max_num_terms: int = 1000000,
    terms: int = None, sort: ['Occurrence', 'Value'] = 'Occurrence',
    text_key_values: bool = False, **kargs)

描述

在定型模型之前,可以在資料上執行類別轉換。

詳細資料

轉換會 categorical 通過資料集,在文字資料行上操作,以建置類別目錄的字典。 針對每個資料列,輸入資料行中顯示的整個文字字串會定義為類別。 類別轉換的輸出是指標向量。 這個向量中的每個位置都會對應至字典中的類別,因此其長度是所建置字典的大小。 類別轉換可以套用至一或多個資料行,在此情況下,它會針對套用的每個資料行建置個別的字典。

categorical 目前不支援處理因數資料。

引數

cols

要轉換的字元字串或變數名稱清單。 如果 dict 為 ,則索引鍵代表要建立的新變數名稱。

output_kind

指定輸出種類的字元字串。

  • "Bag":輸出多組向量。 如果輸入資料行是類別的向量,則輸出會包含一個向量,其中每個位置中的值都是輸入向量中類別的出現次數。 如果輸入資料行包含單一類別,則指標向量和包向量相等

  • "Ind":輸出指標向量。 輸入資料行是類別的向量,而輸出會在輸入資料行中每個位置包含一個指標向量。

  • "Key":輸出索引。 輸出是介於 1 和類別目錄) 字典中類別數之間的整數識別碼 (。

  • "Bin":輸出向量,這是類別的二進位標記法。

預設值是 "Ind"

max_num_terms

整數,指定要包含在字典中的類別數目上限。 預設值為 1000000。

terms

詞彙或類別的選擇性字元向量。

sort

指定排序準則的字元字串。

  • "Occurrence":依出現次數排序類別。 最常見的是先。

  • "Value":依值排序類別。

text_key_values

不論實際的輸入類型為何,索引鍵值中繼資料是否應該是文字。

kargs

傳送至計算引擎的其他引數。

傳回

定義轉換的物件。

另請參閱

categorical_hash

範例

'''
Example on rx_logistic_regression and categorical.
'''
import numpy
import pandas
from microsoftml import rx_logistic_regression, categorical, rx_predict

train_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Do not like it", "Really like it",
        "I hate it", "I like it a lot", "I kind of hate it", "I do like it",
        "I really hate it", "It is very good", "I hate it a bunch", "I love it a bunch",
        "I hate it", "I like it very much", "I hate it very much.",
        "I really do love it", "I really do hate it", "Love it!", "Hate it!",
        "I love it", "I hate it", "I love it", "I hate it", "I love it"],
    like=[True, False, True, False, True, False, True, False, True, False,
        True, False, True, False, True, False, True, False, True, False, True,
        False, True, False, True]))
        
test_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Really like it", "I hate it",
        "I like it a lot", "I love it", "I do like it", "I really hate it", "I love it"]))

# Use a categorical transform: the entire string is treated as a category
out_model = rx_logistic_regression("like ~ reviewCat",
                data=train_reviews,
                ml_transforms=[categorical(cols=dict(reviewCat="review"))])
                
# Note that 'I hate it' and 'I love it' (the only strings appearing more than once)
# have non-zero weights.
print(out_model.coef_)

# Use the model to score.
source_out_df = rx_predict(out_model, data=test_reviews, extra_vars_to_write=["review"])
print(source_out_df.head())

輸出:

Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Not adding a normalizer.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Warning: Too few instances to use 4 threads, decreasing to 1 thread(s)
Beginning optimization
num vars: 20
improvement criterion: Mean Improvement
L1 regularization selected 3 of 20 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:01.6550695
Elapsed time: 00:00:00.2259981
OrderedDict([('(Bias)', 0.21317288279533386), ('I hate it', -0.7937591671943665), ('I love it', 0.19668534398078918)])
Beginning processing data.
Rows Read: 10, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.1385248
Finished writing 10 rows.
Writing completed.
           review PredictedLabel     Score  Probability
0   This is great           True  0.213173     0.553092
1       I hate it          False -0.580586     0.358798
2         Love it           True  0.213173     0.553092
3  Really like it           True  0.213173     0.553092
4       I hate it          False -0.580586     0.358798