ngram：Machine Learning 功能擷取器

文章
05/04/2023

可與 mtText 搭配使用的功能擷取器。

使用方式

  ngramCount(ngramLength = 1, skipLength = 0, maxNumTerms = 1e+07,
    weighting = "tf")

  ngramHash(ngramLength = 1, skipLength = 0, hashBits = 16,
    seed = 314489979, ordered = TRUE, invertHash = 0)

引數

`ngramLength`

指定建構 n-gram 時所要接受標記數目上限的整數。預設值為 1。

`skipLength`

指定建構 n-gram 時所要略過標記數目上限的整數。如果指定為 skip length 的值是 k，則 n-gram 最多可以包含 k 個略過 (不一定連續)。例如，若為 k=2，則從文字「sky is blue today」中擷取的 3-gram 為：「the sky is」、「the sky blue」、「the sky today」、「the is blue」、「the is today」和「the blue today」。預設值為 0。

`maxNumTerms`

指定要包含在字典中類別數目上限的整數。預設值為 10000000。

`weighting`

指定加權準則的字元字串：

"tf"：使用字詞頻率。
"idf"：使用反向文件頻率。
"tfidf"：使用字詞頻率和反向文件頻率。

`hashBits`

整數值。要雜湊處理的位元數目。必須介於 1 到 30 (含) 之間。

`seed`

整數值。雜湊種子。

`ordered`

TRUE 會在雜湊中包含每個字詞的位置。否則為 FALSE。預設值是 TRUE。

`invertHash`

指定可用來產生位置名稱索引鍵數目限制的整數。 0 表示沒有反轉雜湊；-1 表示沒有限制。雖然零值可提供更好的效能，但需要非零值才能取得有意義的係數名稱。

詳細資料

ngramCount 允許定義適用於計數型特徵擷取的引數。可接受的選項如下：ngramLength、skipLength、maxNumTerms 和 weighting。

ngramHash 允許定義適用於雜湊型特徵擷取的引數。可接受的選項如下：ngramLength、skipLength、hashBits、seed、ordered 和 invertHash。

值

定義轉換的字元字串。

作者

Microsoft Corporation Microsoft Technical Support

另請參閱

featurizeText。

範例


  myData <- data.frame(opinion = c(
     "I love it!",
     "I love it!",
     "Love it!",
     "I love it a lot!",
     "Really love it!",
     "I hate it",
     "I hate it",
     "I hate it.",
     "Hate it",
     "Hate"),
     like = rep(c(TRUE, FALSE), each = 5),
     stringsAsFactors = FALSE)

 outModel1 <- rxLogisticRegression(like~opinionCount, data = myData, 
     mlTransforms = list(featurizeText(vars = c(opinionCount = "opinion"), 
         wordFeatureExtractor = ngramHash(invertHash = -1, hashBits = 3)))) 
 summary(outModel1)   

 outModel2 <- rxLogisticRegression(like~opinionCount, data = myData, 
     mlTransforms = list(featurizeText(vars = c(opinionCount = "opinion"), 
         wordFeatureExtractor = ngramCount(maxNumTerms = 5, weighting = "tf"))))         
 summary(outModel2)

分享方式：