microsoftml.featurize_text：將文字資料行轉換成數值特徵

發行項
05/04/2023

使用方式

microsoftml.featurize_text(cols: [str, dict, list], language: ['AutoDetect',
    'English', 'French', 'German', 'Dutch', 'Italian', 'Spanish',
    'Japanese'] = 'English', stopwords_remover=None, case: ['Lower',
    'Upper', 'None'] = 'Lower', keep_diacritics: bool = False,
    keep_punctuations: bool = True, keep_numbers: bool = True,
    dictionary: dict = None, word_feature_extractor={'Name': 'NGram',
    'Settings': {'Weighting': 'Tf', 'MaxNumTerms': [10000000],
    'NgramLength': 1, 'AllLengths': True, 'SkipLength': 0}},
    char_feature_extractor=None, vector_normalizer: ['None', 'L1', 'L2',
    'LInf'] = 'L2', **kargs)

Description

可在訓練模型之前對資料執行的文字轉換。

詳細資料

featurize_text 轉換從指定的文字主體產生一包連續單字序列 (稱為 n-gram) 的計數。達成此作業的方式有兩種：

建置 n-gram 的字典，並使用字典中的識別碼作為包中的索引；
對每個 n-gram 進行雜湊處理，並使用雜湊值作為包中的索引。

雜湊的目的是將可變長度的文字文件轉換成等長數值特徵向量，以支援維度縮減並加速查閱特徵權數。

文字轉換會套用到文字輸入資料行。這可提供語言偵測、Token 化、停用字詞移除、文字正規化及特徵產生。其預設支援下列語言：英文、法文、德文、荷蘭文、義大利文、西班牙文和日文。

n-gram 以計數向量表示，其中向量詞槽對應於 n-gram (使用 n_gram 建立) 或對應於其雜湊 (使用 n_gram_hash 建立)。在向量空間中內嵌 ngram 可讓其內容以有效率的方式進行比較。向量中的詞槽值可以透過下列因素來加權：

字詞頻率 - 詞槽在文字中的出現次數
反向文件頻率 - 一種比率 (反向相對詞槽頻率的對數)，藉由判斷詞槽出現在整體文字中的頻率高低程度來測量詞槽提供的資訊。
字詞頻率/反向文件頻率 - 字詞頻率與反向文件頻率的乘積。

引數

cols

要轉換的字元字串或變數名稱清單。如果是 dict，則索引鍵代表要建立的新變數名稱。

語言

指定資料集中使用的語言。支援下列值：

"AutoDetect"：用於自動語言偵測。
"English"
"French"
"German"
"Dutch"
"Italian"
"Spanish"
"Japanese"

stopwords_remover

指定要使用的停用字詞移除器。支援三個選項：

None：不會使用停用字詞移除器。
predefined：使用先行編譯的語言特定停用字詞清單，其中包含 Microsoft Office 中最常見的單字。
custom：使用者定義的停用字詞清單。可接受的選項如下：stopword。

預設值為 None。

case

使用不因文化特性而異規則的文字大小寫。接受下列值：

"Lower"
"Upper"
"None"

預設值是 "Lower"。

keep_diacritics

False 表示移除變音符號；True 表示保留變音符號。預設值是 False。

keep_punctuations

False 表示移除標點符號；True 表示保留標點符號。預設值是 True。

keep_numbers

False 表示移除數字；True 表示保留數字。預設值是 True。

字典

已加入允許清單之字詞的字典，可接受下列選項：

term：字詞或類別的選擇性字元向量。
dropUnknowns：卸除項目。
sort：指定在向量化時排序項目的方式。支援兩種排序：
- "occurrence"：項目依遇到的順序顯示。
- "value"：項目根據其預設比較來排序。例如，文字排序會區分大小寫 (例如，順序為 'A'、'Z'、'a')。

預設值為 None。請注意，停用字詞清單優先於字典的允許清單，因為在將字典字詞加入允許清單之前已移除停用字詞。

word_feature_extractor

指定字組特徵擷取引數。有兩種不同的特徵擷取機制：

n_gram()：以計數為基礎的特徵擷取 (相當於 WordBag)。可接受的選項如下：max_num_terms 和 weighting。
n_gram_hash()：以雜湊為基礎的特徵擷取 (相當於 WordHashBag)。可接受的選項如下：hash_bits、seed、ordered 和 invert_hash。

預設值是 n_gram。

char_feature_extractor

指定字元特徵擷取引數。有兩種不同的特徵擷取機制：

n_gram()：以計數為基礎的特徵擷取 (相當於 WordBag)。可接受的選項如下：max_num_terms 和 weighting。
n_gram_hash()：以雜湊為基礎的特徵擷取 (相當於 WordHashBag)。可接受的選項如下：hash_bits、seed、ordered 和 invert_hash。

預設值為 None。

vector_normalizer

透過將向量 (資料列) 調整為單位範數，個別將其正規化。接受下列其中一個值：

"None"
"L2"
"L1"
"LInf"

預設值是 "L2"。

kargs

傳送至計算引擎的其他引數。

傳回

定義轉換的物件。

另請參閱

n_gram, n_gram_hash, n_gram, n_gram_hash, get_sentiment.

範例

'''
Example with featurize_text and rx_logistic_regression.
'''
import numpy
import pandas
from microsoftml import rx_logistic_regression, featurize_text, rx_predict
from microsoftml.entrypoints._stopwordsremover_predefined import predefined


train_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Do not like it", "Really like it",
        "I hate it", "I like it a lot", "I kind of hate it", "I do like it",
        "I really hate it", "It is very good", "I hate it a bunch", "I love it a bunch",
        "I hate it", "I like it very much", "I hate it very much.",
        "I really do love it", "I really do hate it", "Love it!", "Hate it!",
        "I love it", "I hate it", "I love it", "I hate it", "I love it"],
    like=[True, False, True, False, True, False, True, False, True, False,
        True, False, True, False, True, False, True, False, True, False, True,
        False, True, False, True]))
        
test_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Really like it", "I hate it",
        "I like it a lot", "I love it", "I do like it", "I really hate it", "I love it"]))

out_model = rx_logistic_regression("like ~ review_tran",
                    data=train_reviews,
                    ml_transforms=[
                        featurize_text(cols=dict(review_tran="review"),
                            stopwords_remover=predefined(),
                            keep_punctuations=False)])
                            
# Use the model to score.
score_df = rx_predict(out_model, data=test_reviews, extra_vars_to_write=["review"])
print(score_df.head())

輸出：

Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Not adding a normalizer.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Warning: Too few instances to use 4 threads, decreasing to 1 thread(s)
Beginning optimization
num vars: 11
improvement criterion: Mean Improvement
L1 regularization selected 3 of 11 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.3725934
Elapsed time: 00:00:00.0131199
Beginning processing data.
Rows Read: 10, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0635453
Finished writing 10 rows.
Writing completed.
           review PredictedLabel     Score  Probability
0   This is great           True  0.443986     0.609208
1       I hate it          False -0.668449     0.338844
2         Love it           True  0.994339     0.729944
3  Really like it           True  0.443986     0.609208
4       I hate it          False -0.668449     0.338844