rxEnsemble：集成

文章
01/02/2025

將一整團模型定型

使用方式

  rxEnsemble(formula = NULL, data, trainers, type = c("binary", "regression",
    "multiClass", "anomaly"), randomSeed = NULL,
    modelCount = length(trainers), replace = FALSE, sampRate = NULL,
    splitData = FALSE, combineMethod = c("median", "average", "vote"),
    maxCalibration = 1e+05, mlTransforms = NULL, mlTransformVars = NULL,
    rowSelection = NULL, transforms = NULL, transformObjects = NULL,
    transformFunc = NULL, transformVars = NULL, transformPackages = NULL,
    transformEnvir = NULL, blocksPerRead = rxGetOption("blocksPerRead"),
    reportProgress = rxGetOption("reportProgress"), verbose = 1,
    computeContext = rxGetOption("computeContext"), ...)

引數

`formula`

如 rxFormula 中所說明的公式。 MicrosoftML 中目前不支援互動字詞和 F()。

`data`

指定 .xdffile 或資料框架物件的資料來源物件或字元字串。或者，也可以是資料來源清單，表示每個模型都應該使用清單中的其中一個資料來源訓練。在此情況下，資料清單的長度必須等於 modelCount。

`trainers`

具有其引數的訓練器清單。建立訓練器是使用 fastTrees、fastForest、fastLinear、logisticRegression 或 neuralNet。

`type`

指定集成類型的字元字串：二進位分類為 "binary"，迴歸為 "regression"`。

`randomSeed`

指定隨機種子。預設值是 NULL。

`modelCount`

指定要訓練的模型數目。如果此數字大於訓練器清單的長度，則訓練器清單會重複以符合 modelCount。

`replace`

邏輯值，指定觀測的取樣應該重複或不重複進行。預設值為 /codeFALSE。

`sampRate`

正值的純量，可指定要針對每個定型模組取樣的觀察百分比。預設值為 1.0，表示使用取代進行取樣 (亦即，replace=TRUE)，若為 0.632，表示不使用取代進行取樣 (亦即，replace=FALSE)。當 splitData 為 TRUE 時，sampRate 的預設為 1.0 (分割前不會取樣)。

`splitData`

邏輯值，指定是否要在非重疊的分割區上定型基本模型。預設值為 FALSE。僅適用於 RxSpark 計算內容，其他內容則會忽略。

`combineMethod`

指定用來合併模型的方法：

median 可計算個別模型輸出的中位數，
average 可計算個別模型輸出的平均值，而
vote 可計算模型的 (pos-neg)/總數，其中 'pos' 是正輸出的數目，而 'neg' 則是負輸出的數目。

`maxCalibration`

指定用於校正的範例數目上限。除了二元分類之外，所有其他工作都會忽略此引數。

`mlTransforms`

指定於訓練前在資料上執行的 MicrosoftML 轉換清單，若沒有任何轉換要執行時，則指定為 NULL。不允許需要額外忽略資料的轉換 (例如 featurizeText、categorical)。這些轉換會在任何指定的 R 轉換之後執行。預設值是 NULL。

`mlTransformVars`

指定要用於 mlTransforms 的變數名稱字元向量，或者，若不使用則為 NULL。預設值是 NULL。

`rowSelection`

指定資料集中要供模型使用的資料列 (觀測值)，可以是來自資料集的邏輯變數名稱 (以引號括住)，或是使用資料集中變數的邏輯運算式。例如，rowSelection = "old" 將只會使用變數 old 值為 TRUE 的觀測值。 rowSelection = (age > 20) & (age < 65) & (log(income) > 10) 只會使用 age 變數值介於 20 到 65 之間且 income 變數 log 值大於 10 的觀察值。資料列選取會在處理任何資料轉換之後執行 (請參閱引數 transforms 或 transformFunc)。如同所有運算式，rowSelection 可以使用運算式函數在函數呼叫之外定義。

`transforms`

代表第一輪變數轉換形式 list(name = expression, ``...) 的運算式。如同所有運算式，transforms (或 rowSelection) 可以使用運算式函數在函數呼叫之外定義。預設值是 NULL。

`transformObjects`

具名清單，其中包含 transforms、transformsFunc 和 rowSelection 可以參考的物件。預設值是 NULL。

`transformFunc`

變數轉換函數。如需詳細資料，請參閱 rxTransform。預設值是 NULL。

`transformVars`

轉換函數所需之輸入資料集變數的字元向量。如需詳細資料，請參閱 rxTransform。預設值是 NULL。

`transformPackages`

一個字元向量，用以指定要供作使用，以及預先載入以在變數轉換函數中使用的其他 R 套件 (在 rxGetOption("transformPackages") 中指定的套件之外)。例如，RevoScaleR 函數中透過其 transforms 和 transformFunc 引數明確定義，或透過其 formula 或 rowSelection 引數隱含定義的字元向量。 transformPackages 引數也可以是 NULL，表示不預先載入 rxGetOption("transformPackages") 之外的套件。預設值是 NULL。

`transformEnvir`

使用者定義的環境，作為內部開發之所有環境的父系且用於變數資料轉換。如果 transformEnvir = NULL，則會改用具有父系 baseenv() 的新「雜湊」環境。預設值是 NULL。

`blocksPerRead`

指定要針對從資料來源讀取之每個資料區塊讀取的區塊數目。

`reportProgress`

指定資料列處理進度報告層級的整數值：

0：未報告進度。
1：已列印和更新處理的資料列數目。
2：報告已處理的資料列數目與時間。
3：已處理資料列且已報告所有時間。

`verbose`

指定要輸出數量的整數值。若為 0，則計算期間不會列印任何詳細資訊輸出。整數值 1 到 4 提供越來越多的資訊量。預設值是 1。

`computeContext`

設定執行計算的內容，以有效的 RxComputeCoNtext 指定。目前支援本機和 RxSpark 計算內容。指定 RxSpark 時，會以分散式方式完成模型的訓練，並在本機完成集成。請注意，計算內容不可為非等待中。

`...`

直接傳遞至 Microsoft Compute Engine 的額外引數。

詳細資料

/coderxEnsemble 是一個函數，可訓練各種類型的數個模型，以取得與從單一模型取得的效能相比更佳的預測效能。

值

具有已訓練集成模型的 rxEnsemble 物件。

範例


 # Create an ensemble of regression rxFastTrees models

 # use xdf data source
 dataFile <- file.path(rxGetOption("sampleDataDir"), "claims4blocks.xdf")
 rxGetInfo(dataFile, getVarInfo = TRUE, getBlockSizes = TRUE)
 form <- cost ~ age + type + number

 rxSetComputeContext("localpar")
 rxGetComputeContext()

 # build an ensemble model that contains three 'rxFastTrees' models with different parameters
 ensemble <- rxEnsemble(
     formula = form,
     data = dataFile,
     type = "regression",
     trainers = list(fastTrees(), fastTrees(numTrees = 60), fastTrees(learningRate = 0.1)), #a list of trainers with their arguments.
     replace = TRUE # Indicates using a bootstrap sample for each trainer
     )

 # use text data source
 colInfo <- list(DayOfWeek = list(type = "factor", levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))

 source <- system.file("SampleData/AirlineDemoSmall.csv", package = "RevoScaleR")
 data <- RxTextData(source, missingValueString = "M", colInfo = colInfo)

 # When 'distributed' is TRUE distributed data source is created
 distributed <- FALSE
 if (distributed) {
     bigDataDirRoot <- "/share"
     inputDir <- file.path(bigDataDirRoot, "AirlineDemoSmall")
     rxHadoopMakeDir(inputDir)
     rxHadoopCopyFromLocal(source, inputDir)
     hdfsFS <- RxHdfsFileSystem()
     data <- RxTextData(file = inputDir, missingValueString = "M", colInfo = colInfo, fileSystem = hdfsFS)
 }

 # When 'distributed' is TRUE training is distributed
 if (distributed) {
     cc <- rxSetComputeContext(RxSpark())
 } else {
     cc <- rxGetComputeContext()
 }

 ensemble <- rxEnsemble(
     formula = ArrDelay ~ DayOfWeek,
     data = data,
     type = "regression",
     trainers = list(fastTrees(), fastTrees(numTrees = 60), fastTrees(learningRate = 0.1)), # The ensemble will contain three 'rxFastTrees' models
     replace = TRUE # Indicates using a bootstrap sample for each trainer
     )

 # Change the compute context back to previous for scoring
 rxSetComputeContext(cc)

 # Put score and model variables in data frame
 scores <- rxPredict(ensemble, data = data, writeModelVars = TRUE)

 # Plot actual versus predicted values with smoothed line
 rxLinePlot(Score ~ ArrDelay, type = c("p", "smooth"), data = scores)

分享方式：