mutualInformation:特征选择互信息模式

特征选择转换 selectFeatures 中使用的特征选择的互信息模式。

用法

  mutualInformation(numFeaturesToKeep = 1000, numBins = 256, ...)

参数

numFeaturesToKeep

如果要保留的特征数指定为 n,则转换会选取与依赖变量具有最高互信息的 n 个特征。 默认值为 1000。

numBins

数值的最大箱数。 建议使用 2 的幂。 默认值为 256。

...

要直接传递到 Microsoft 计算引擎的其他参数。

详细信息

两个随机变量 XY 的互信息是变量之间相互依赖的度量值。 更准确地讲,互信息可以编写为:

I(X;Y) = E[log(p(x,y)) - log(p(x)) - log(p(y))]

其中,预期值将接管 XY 的联合分发。 此处的 p(x,y)XY 的联合概率密度函数,p(x)p(y) 分别是 XY 的边缘概率密度函数。 一般而言,依赖变量(或标签)与独立变量(或特征)之间的互信息越多,意味着标签对该特征具有更高的相互依赖关系。

互信息特征选择模式是根据互信息选择特征的。 它保留标签中最大互信息的前 numFeaturesToKeep 个特征。

定义模式的字符串。

作者

Microsoft Corporation Microsoft Technical Support

参考

Wikipedia: Mutual Information

另请参阅

minCountselectFeatures

示例


 trainReviews <- data.frame(review = c( 
         "This is great",
         "I hate it",
         "Love it",
         "Do not like it",
         "Really like it",
         "I hate it",
         "I like it a lot",
         "I kind of hate it",
         "I do like it",
         "I really hate it",
         "It is very good",
         "I hate it a bunch",
         "I love it a bunch",
         "I hate it",
         "I like it very much",
         "I hate it very much.",
         "I really do love it",
         "I really do hate it",
         "Love it!",
         "Hate it!",
         "I love it",
         "I hate it",
         "I love it",
         "I hate it",
         "I love it"),
      like = c(TRUE, FALSE, TRUE, FALSE, TRUE,
         FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
         FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, 
         FALSE, TRUE, FALSE, TRUE), stringsAsFactors = FALSE
     )

     testReviews <- data.frame(review = c(
         "This is great",
         "I hate it",
         "Love it",
         "Really like it",
         "I hate it",
         "I like it a lot",
         "I love it",
         "I do like it",
         "I really hate it",
         "I love it"), stringsAsFactors = FALSE)

 # Use a categorical hash transform which generated 128 features.
 outModel1 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7)))
 summary(outModel1)

 # Apply a categorical hash transform and a count feature selection transform
 # which selects only those hash features that has value.
 outModel2 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(
   categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7), 
   selectFeatures("reviewCatHash", mode = minCount())))
 summary(outModel2)

 # Apply a categorical hash transform and a mutual information feature selection transform
 # which selects those features appearing with at least a count of 5.
 outModel3 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(
   categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7), 
   selectFeatures("reviewCatHash", mode = minCount(count = 5))))
 summary(outModel3)