microsoftml.rx_featurize：資料來源的資料轉換

使用方式

microsoftml.rx_featurize(data: typing.Union[revoscalepy.datasource.RxDataSource.RxDataSource,
    pandas.core.frame.DataFrame],
    output_data: typing.Union[revoscalepy.datasource.RxDataSource.RxDataSource,
    str] = None, overwrite: bool = False,
    data_threads: int = None, random_seed: int = None,
    max_slots: int = 5000, ml_transforms: list = None,
    ml_transform_vars: list = None, row_selection: str = None,
    transforms: dict = None, transform_objects: dict = None,
    transform_function: str = None,
    transform_variables: list = None,
    transform_packages: list = None,
    transform_environment: dict = None, blocks_per_read: int = None,
    report_progress: int = None, verbose: int = 1,
    compute_context: revoscalepy.computecontext.RxComputeContext.RxComputeContext = None)

Description

將資料從輸入資料集轉換至輸出資料集。

引數

data

revoscalepy 資料來源物件、資料框架或 .xdf 檔案的路徑。

output_data

輸出文字或 XDF 檔案名稱，或者具有寫入功能的 RxDataSource，可用來儲存已轉換的資料。若為 None，則會傳回資料框架。預設值為 None。

overwrite

若為 True，則會覆寫現有的 output_data；若為 False，則不會覆寫現有的 output_data。預設值是 False。

data_threads

整數，指定資料管線中所需的平行處理原則程度。若為 None，則會在內部決定使用的執行緒數目。預設值為 None。

random_seed

指定隨機種子。預設值為 None。

max_slots

要針對向量值資料行傳回的最大位置 (<=0 以傳回全部)。

ml_transforms

指定在訓練之前要在資料上執行的 MicrosoftML 轉換清單，或者，若未執行任何轉換，則為 None。請參閱 featurize_text、categorical 與 categorical_hash，以了解哪有些支援的轉換。這些轉換會在任何指定的 Python 轉換之後執行。預設值為 None。

ml_transform_vars

指定要在 ml_transforms 中使用之變數名稱的字元向量，或者，若未使用任何名稱，則為 None。預設值為 None。

row_selection

不支援。指定資料集中要供模型使用的資料列 (觀測值)，可以是來自資料集的邏輯變數名稱 (以引號括住)，或是使用資料集中變數的邏輯運算式。例如：

row_selection = "old" 將只會使用 old 變數值為 True 的觀測值。
row_selection = (age > 20) & (age < 65) & (log(income) > 10) 只會使用 age 變數值介於 20 到 65 之間且 income 變數 log 值大於 10 的觀察值。

資料列選取會在處理任何資料轉換之後執行 (請參閱引數 transforms 或 transform_function)。如同所有運算式，可以在函數呼叫之外使用 expression 函數定義 row_selection。

轉換

不支援。代表第一輪變數轉換形式的運算式。如同所有運算式，transforms (或 row_selection) 可以使用 expression 函數在函數呼叫以外定義。預設值為 None。

transform_objects

不支援。具名清單，其中包含 transforms、transform_function 和 row_selection 可以參考的物件。預設值為 None。

transform_function

變數轉換函數。預設值為 None。

transform_variables

轉換函數所需之輸入資料集變數的字元向量。預設值為 None。

transform_packages

不支援。字元向量，用以指定其他 Python 套件 (除了 RxOptions.get_option("transform_packages") 中指定的套件以外) 以供使用，並且預先載入以供變數轉換函數使用。例如，revoscalepy 函數中透過其 transforms 與 transform_function 引數明確定義的字元向量，或透過其 formula 或 row_selection 引數隱含定義的字元向量。 transform_packages 引數也可能是 None，表示並未預先載入 RxOptions.get_option("transform_packages") 以外的任何套件。

transform_environment

不支援。使用者定義的環境，作為內部開發之所有環境的父系且用於變數資料轉換。若 transform_environment = None，則會改為使用具有父 revoscalepy.baseenv 的新「雜湊」環境。預設值為 None。

blocks_per_read

指定要針對從資料來源讀取之每個資料區塊讀取的區塊數目。

report_progress

指定資料列處理進度報告層級的整數值：

0：未報告進度。
1：已列印和更新處理的資料列數目。
2：報告已處理的資料列數目與時間。
3：報告已處理的資料列數目與所有時間。

預設值是 1。

verbose

指定所需輸出數量的整數值。若為 0，則計算期間不會列印任何詳細資訊輸出。整數值 1 到 4 提供越來越多的資訊量。預設值是 1。

compute_context

設定用來執行計算的內容，以有效的 revoscalepy.RxComputeContext 指定。目前支援本機和 revoscalepy.RxInSqlServer 計算內容。

傳回

資料框架或 revoscalepy.RxDataSource 物件，代表建立的輸出資料。

另請參閱

rx_predict、revoscalepy.rx_data_step、revoscalepy.rx_import.

範例

'''
Example with rx_featurize.
'''
import numpy
import pandas
from microsoftml import rx_featurize, categorical

# rx_featurize basically allows you to access data from the MicrosoftML transforms
# In this example we'll look at getting the output of the categorical transform
# Create the data
categorical_data = pandas.DataFrame(data=dict(places_visited=[
                "London", "Brunei", "London", "Paris", "Seria"]),
                dtype="category")
                
print(categorical_data)

# Invoke the categorical transform
categorized = rx_featurize(data=categorical_data,
                           ml_transforms=[categorical(cols=dict(xdatacat="places_visited"))])

# Now let's look at the data
print(categorized)

輸出：

  places_visited
0         London
1         Brunei
2         London
3          Paris
4          Seria
Beginning processing data.
Rows Read: 5, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 5, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0521300
Finished writing 5 rows.
Writing completed.
  places_visited  xdatacat.London  xdatacat.Brunei  xdatacat.Paris  \
0         London              1.0              0.0             0.0   
1         Brunei              0.0              1.0             0.0   
2         London              1.0              0.0             0.0   
3          Paris              0.0              0.0             1.0   
4          Seria              0.0              0.0             0.0   

   xdatacat.Seria  
0             0.0  
1             0.0  
2             0.0  
3             0.0  
4             1.0

反饋

呢頁幫到你嗎？

Last updated on 2025-01-02