mutualInformation：特徵選取相互資訊模式

發行項
05/04/2023

特徵選取項目轉換 selectFeatures 中使用的特徵選取相互資訊模式。

使用方式

  mutualInformation(numFeaturesToKeep = 1000, numBins = 256, ...)

引數

`numFeaturesToKeep`

如果將保留的特徵數目指定為 n，轉換會挑選與相依變數具有最高相互資訊的 n 特徵。預設值為 1000。

`numBins`

適用於數值的 bin 數目上限。建議使用 2 的乘冪。預設值為 256。

`...`

直接傳遞至 Microsoft Compute Engine 的額外引數。

詳細資料

兩個隨機變數 X 和 Y 的相互資訊，是變數之間相互相依性的量值。形式上，可將相互資訊撰寫為：

I(X;Y) = E[log(p(x,y)) - log(p(x)) - log(p(y))]

其中期望會接管 X 和 Y 的聯合分佈。在這裡，p(x,y) 是 X 和 Y 的聯合機率密度函數，而 p(x) 和 p(y) 則是 X 和 Y 的邊際機率密度函數。一般而言，相依變數 (或標籤) 與獨立變數 (或特徵) 之間的相互資訊較高，表示標籤與該特徵的相互相依性較高。

相互資訊特徵選取模式會根據相互資訊選取特徵。它會使用標籤來保留具有最大相互資訊的最上層 numFeaturesToKeep 特徵。

值

定義模式的字元字串。

作者

Microsoft Corporation Microsoft Technical Support

參考資料

Wikipedia: Mutual Information

另請參閱

minCount selectFeatures

範例


 trainReviews <- data.frame(review = c( 
         "This is great",
         "I hate it",
         "Love it",
         "Do not like it",
         "Really like it",
         "I hate it",
         "I like it a lot",
         "I kind of hate it",
         "I do like it",
         "I really hate it",
         "It is very good",
         "I hate it a bunch",
         "I love it a bunch",
         "I hate it",
         "I like it very much",
         "I hate it very much.",
         "I really do love it",
         "I really do hate it",
         "Love it!",
         "Hate it!",
         "I love it",
         "I hate it",
         "I love it",
         "I hate it",
         "I love it"),
      like = c(TRUE, FALSE, TRUE, FALSE, TRUE,
         FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
         FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, 
         FALSE, TRUE, FALSE, TRUE), stringsAsFactors = FALSE
     )

     testReviews <- data.frame(review = c(
         "This is great",
         "I hate it",
         "Love it",
         "Really like it",
         "I hate it",
         "I like it a lot",
         "I love it",
         "I do like it",
         "I really hate it",
         "I love it"), stringsAsFactors = FALSE)

 # Use a categorical hash transform which generated 128 features.
 outModel1 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7)))
 summary(outModel1)

 # Apply a categorical hash transform and a count feature selection transform
 # which selects only those hash features that has value.
 outModel2 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(
   categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7), 
   selectFeatures("reviewCatHash", mode = minCount())))
 summary(outModel2)

 # Apply a categorical hash transform and a mutual information feature selection transform
 # which selects those features appearing with at least a count of 5.
 outModel3 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(
   categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7), 
   selectFeatures("reviewCatHash", mode = minCount(count = 5))))
 summary(outModel3)