rxFeaturize：適用於 RevoScaleR 資料來源的資料轉換

發行項
05/04/2023

將資料從輸入資料集轉換至輸出資料集。

使用方式

  rxFeaturize(data, outData = NULL, overwrite = FALSE, dataThreads = NULL,
    randomSeed = NULL, maxSlots = 5000, mlTransforms = NULL,
    mlTransformVars = NULL, rowSelection = NULL, transforms = NULL,
    transformObjects = NULL, transformFunc = NULL, transformVars = NULL,
    transformPackages = NULL, transformEnvir = NULL,
    blocksPerRead = rxGetOption("blocksPerRead"),
    reportProgress = rxGetOption("reportProgress"), verbose = 1,
    computeContext = rxGetOption("computeContext"), ...)

引數

`data`

RevoScaleR 資料來源物件、資料框架或 .xdf 檔案的路徑。

`outData`

輸出文字或 XDF 檔案名稱，或者具有寫入功能的 RxDataSource，可用來儲存已轉換的資料。若為 NULL，則會傳回資料框架。預設值是 NULL。

`overwrite`

若為 TRUE，則會覆寫現有的 outData；若為 FALSE，則不會覆寫現有的 outData。預設值為 /codeFALSE。

`dataThreads`

整數，指定資料管線中所需的平行處理原則程度。若為 NULL，則會在內部決定使用的執行緒數目。預設值是 NULL。

`randomSeed`

指定隨機種子。預設值是 NULL。

`maxSlots`

要針對向量值資料行傳回的最大位置 (<=0 以傳回全部)。

`mlTransforms`

指定於訓練前在資料上執行的 MicrosoftML 轉換清單，若沒有任何轉換要執行時，則指定為 NULL。請參閱 featurizeText、categorical 和 categoricalHash 以了解支援的轉換。這些轉換會在任何指定的 R 轉換之後執行。預設值是 NULL。

`mlTransformVars`

指定要用於 mlTransforms 的變數名稱字元向量，或者，若不使用則為 NULL。預設值是 NULL。

`rowSelection`

指定資料集中要供模型使用的資料列 (觀測值)，可以是來自資料集的邏輯變數名稱 (以引號括住)，或是使用資料集中變數的邏輯運算式。例如，rowSelection = "old" 將只會使用變數 old 值為 TRUE 的觀測值。 rowSelection = (age > 20) & (age < 65) & (log(income) > 10) 只會使用 age 變數值介於 20 到 65 之間且 income 變數 log 值大於 10 的觀察值。資料列選取會在處理任何資料轉換之後執行 (請參閱引數 transforms 或 transformFunc)。如同所有運算式，rowSelection 可以使用運算式函數在函數呼叫之外定義。

`transforms`

代表第一輪變數轉換形式 list(name = expression, ``...) 的運算式。如同所有運算式，transforms (或 rowSelection) 可以使用運算式函數在函數呼叫之外定義。預設值是 NULL。

`transformObjects`

具名清單，其中包含 transforms、transformsFunc 和 rowSelection 可以參考的物件。預設值是 NULL。

`transformFunc`

變數轉換函數。如需詳細資料，請參閱 rxTransform。預設值是 NULL。

`transformVars`

轉換函數所需之輸入資料集變數的字元向量。如需詳細資料，請參閱 rxTransform。預設值是 NULL。

`transformPackages`

一個字元向量，用以指定要供作使用，以及預先載入以在變數轉換函數中使用的其他 R 套件 (在 rxGetOption("transformPackages") 中指定的套件之外)。例如，RevoScaleR 函數中透過其 transforms 和 transformFunc 引數明確定義，或透過其 formula 或 rowSelection 引數隱含定義的字元向量。 transformPackages 引數也可以是 NULL，表示不預先載入 rxGetOption("transformPackages") 之外的套件。預設值是 NULL。

`transformEnvir`

使用者定義的環境，作為內部開發之所有環境的父系且用於變數資料轉換。如果為 transformEnvir = NULL，則會改用具有父系 baseenv() 的新「雜湊」環境。預設值為 NULL。

`blocksPerRead`

指定要針對從資料來源讀取之每個資料區塊讀取的區塊數目。

`reportProgress`

指定資料列處理進度報告層級的整數值：

0：未報告進度。
1：已列印和更新處理的資料列數目。
2：報告已處理的資料列數目與時間。
3：報告已處理的資料列數目與所有時間。
預設值是 1。

`verbose`

指定所需輸出數量的整數值。若為 0，則計算期間不會列印任何詳細資訊輸出。整數值 1 到 4 提供越來越多的資訊量。預設值是 1。

`computeContext`

設定執行計算的內容，以有效的 RxComputeContext 指定。目前支援本機和 RxInSqlServer 計算內容。

`...`

直接傳遞至 Microsoft Compute Engine 的額外引數。

值

資料框架或 RxDataSource 物件，代表建立的輸出資料。

作者

Microsoft Corporation Microsoft Technical Support

另請參閱

rxDataStep、rxImport、rxTransform。

範例


 # rxFeaturize basically allows you to access data from the MicrosoftML transforms
 # In this example we'll look at getting the output of the categorical transform

 # Create the data
 categoricalData <- data.frame(
   placesVisited = c(
     "London",
     "Brunei",
     "London",
     "Paris",
     "Seria"
   ),
   stringsAsFactors = FALSE
 )

 # Invoke the categorical transform
 categorized <- rxFeaturize(
   data = categoricalData,
   mlTransforms = list(categorical(vars = c(xDataCat = "placesVisited")))
 )

 # Now let's look at the data
 categorized