ngram：机器学习特征提取器

项目
05/23/2023

可与 mtText 一起使用的特征提取器。

用法

  ngramCount(ngramLength = 1, skipLength = 0, maxNumTerms = 1e+07,
    weighting = "tf")

  ngramHash(ngramLength = 1, skipLength = 0, hashBits = 16,
    seed = 314489979, ordered = TRUE, invertHash = 0)

参数

`ngramLength`

一个整数，指定在构造 n-gram 时要采用的最大标记数。默认值为 1。

`skipLength`

一个整数，指定在构造 n-gram 时要跳过的最大标记数。如果指定为跳过长度的值为 k，则 n-gram 最多可以包含 k 个跳过（不一定是连续的）。例如，如果 k=2，那么从文本“the sky is blue today”中提取的 3-gram 是：“the sky is”、“the sky blue”、“the sky today”、“the is blue”、“the is today”和“the blue today”。默认值为 0。

`maxNumTerms`

指定字典中要包含的最大类别数的整数。默认值为 10000000。

`weighting`

指定加权条件的字符串：

"tf"：使用术语频率。
"idf"：使用反转文档频率。
"tfidf"：使用术语频率和反转文档频率。

`hashBits`

整数值。要哈希到的位数。必须介于 1 和 30 之间（含限值）。

`seed`

整数值。哈希种子。

`ordered`

TRUE 表示包含每个条件在哈希中的位置。否则为 FALSE。默认值是 TRUE。

`invertHash`

一个整数，指定可用于生成槽名称的键数限制。 0 表示无反转哈希；-1 表示无限制。虽然零值提供更好的性能，但需要非零值才能获得有意义的系数名称。

详细信息

ngramCount 允许为基于计数的特征提取定义参数。它接受以下选项：ngramLength、skipLength、maxNumTerms 和 weighting。

ngramHash 允许为基于哈希的特征提取定义参数。它接受以下选项：ngramLength、skipLength、hashBits、seed、ordered 和 invertHash。

值

定义转换的字符串。

作者

Microsoft Corporation Microsoft Technical Support

另请参阅

featurizeText。

示例


  myData <- data.frame(opinion = c(
     "I love it!",
     "I love it!",
     "Love it!",
     "I love it a lot!",
     "Really love it!",
     "I hate it",
     "I hate it",
     "I hate it.",
     "Hate it",
     "Hate"),
     like = rep(c(TRUE, FALSE), each = 5),
     stringsAsFactors = FALSE)

 outModel1 <- rxLogisticRegression(like~opinionCount, data = myData, 
     mlTransforms = list(featurizeText(vars = c(opinionCount = "opinion"), 
         wordFeatureExtractor = ngramHash(invertHash = -1, hashBits = 3)))) 
 summary(outModel1)   

 outModel2 <- rxLogisticRegression(like~opinionCount, data = myData, 
     mlTransforms = list(featurizeText(vars = c(opinionCount = "opinion"), 
         wordFeatureExtractor = ngramCount(maxNumTerms = 5, weighting = "tf"))))         
 summary(outModel2)

通过