ngram: Machine Learning 特徴抽出

2025-01-02

mtText で使用できる特徴抽出。

使用方法

  ngramCount(ngramLength = 1, skipLength = 0, maxNumTerms = 1e+07,
    weighting = "tf")

  ngramHash(ngramLength = 1, skipLength = 0, hashBits = 16,
    seed = 314489979, ordered = TRUE, invertHash = 0)

引数

`ngramLength`

n-gram を作成するときに取得するトークンの最大数を指定する整数。既定値は 1 です。

`skipLength`

n-gram を作成するときにスキップするトークンの最大数を指定する整数。スキップの長さとして値 k を指定した場合、n-gram には最大で k 個のスキップが含まれる可能性があります (連続しているとは限りません)。たとえば、k=2 の場合、"the sky is blue today" というテキストから抽出される 3-gram は、"the sky is"、"the sky blue"、"the sky today"、"the is blue"、"the is today"、"the blue today" になります。既定値は 0 です。

`maxNumTerms`

辞書に含めるカテゴリの最大数を指定する整数。既定値は 10000000 です。

`weighting`

重み付け条件を指定する文字列。

"tf": 用語の頻度を使用します。
"idf": 逆ドキュメント頻度を使用します。
"tfidf": 用語の頻度と逆ドキュメント頻度の両方を使用します。

`hashBits`

整数値。ハッシュ後のビット数。 1 から 30 までの数にする必要があります (1 と 30 も含まれます)。

`seed`

整数値。ハッシュシード。

`ordered`

各用語の位置をハッシュに含める場合は TRUE。それ以外の場合は FALSE。既定値は TRUE です。

`invertHash`

スロット名を生成するために使用できるキー数の制限を指定する整数。 0 は逆ハッシュがないことを意味し、-1 は制限がないことを意味します。 0 の値を使用するとパフォーマンスが向上しますが、意味のある係数の名前を取得するには 0 以外の値が必要です。

詳細

ngramCount では、カウントベースの特徴抽出の引数を定義できます。 ngramLength、skipLength、maxNumTerms、weighting の各オプションを受け入れます。

ngramHash では、ハッシュベースの特徴抽出の引数を定義できます。 ngramLength、skipLength、hashBits、seed、ordered、invertHash の各オプションを受け入れます。

値

変換を定義する文字列。

作成者

Microsoft Corporation Microsoft Technical Support

こちらもご覧ください

featurizeText。

使用例


  myData <- data.frame(opinion = c(
     "I love it!",
     "I love it!",
     "Love it!",
     "I love it a lot!",
     "Really love it!",
     "I hate it",
     "I hate it",
     "I hate it.",
     "Hate it",
     "Hate"),
     like = rep(c(TRUE, FALSE), each = 5),
     stringsAsFactors = FALSE)

 outModel1 <- rxLogisticRegression(like~opinionCount, data = myData, 
     mlTransforms = list(featurizeText(vars = c(opinionCount = "opinion"), 
         wordFeatureExtractor = ngramHash(invertHash = -1, hashBits = 3)))) 
 summary(outModel1)   

 outModel2 <- rxLogisticRegression(like~opinionCount, data = myData, 
     mlTransforms = list(featurizeText(vars = c(opinionCount = "opinion"), 
         wordFeatureExtractor = ngramCount(maxNumTerms = 5, weighting = "tf"))))         
 summary(outModel2)

次の方法で共有

ngram: Machine Learning 特徴抽出

使用方法

引数

ngramLength

skipLength

maxNumTerms

weighting

hashBits

seed

ordered

invertHash

詳細

値

作成者