mutualInformation: 기능 선택 상호 정보 모드

아티클
05/23/2023

기능 선택 변환 selectFeatures에서 사용되는 기능 선택의 상호 정보 모드입니다.

사용

  mutualInformation(numFeaturesToKeep = 1000, numBins = 256, ...)

인수

`numFeaturesToKeep`

유지할 기능의 수가 n으로 지정된 경우 변환은 종속 변수와의 상호 정보가 가장 큰 n개 기능을 선택합니다. 기본값은 1000입니다.

`numBins`

숫자 값의 최대 bin 개수. 2의 거듭제곱을 권장합니다. 기본값은 256입니다.

`...`

Microsoft 컴퓨팅 엔진에 직접 전달할 추가 인수입니다.

세부 정보

임의의 두 변수 X 및 Y의 상호 정보는 변수 간의 상호 종속성 측정값입니다. 공식적으로 상호 정보는 다음과 같이 작성될 수 있습니다.

I(X;Y) = E[log(p(x,y)) - log(p(x)) - log(p(y))]

여기서 X 및 Y의 공동 배포에 대한 기대가 나옵니다. p(x,y)는 X와 Y의 공동 확률 밀도 함수이고, p(x)와 p(y)는 각각 X와 Y의 한계 확률 밀도 함수입니다. 일반적으로 종속 변수(또는 레이블)와 독립 변수(또는 기능) 간의 상호 정보가 클수록 레이블이 해당 기능에 대해 더 높은 상호 종속성을 갖게 됩니다.

상호 정보 기능 선택 모드는 상호 정보에 따라 기능을 선택합니다. 레이블과의 상호 정보가 가장 큰 상위 numFeaturesToKeep개 기능을 유지합니다.

값

모드를 정의하는 문자열입니다.

작성자

Microsoft Corporation Microsoft Technical Support

참조

Wikipedia: Mutual Information

추가 정보

minCount selectFeatures

예


 trainReviews <- data.frame(review = c( 
         "This is great",
         "I hate it",
         "Love it",
         "Do not like it",
         "Really like it",
         "I hate it",
         "I like it a lot",
         "I kind of hate it",
         "I do like it",
         "I really hate it",
         "It is very good",
         "I hate it a bunch",
         "I love it a bunch",
         "I hate it",
         "I like it very much",
         "I hate it very much.",
         "I really do love it",
         "I really do hate it",
         "Love it!",
         "Hate it!",
         "I love it",
         "I hate it",
         "I love it",
         "I hate it",
         "I love it"),
      like = c(TRUE, FALSE, TRUE, FALSE, TRUE,
         FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
         FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, 
         FALSE, TRUE, FALSE, TRUE), stringsAsFactors = FALSE
     )

     testReviews <- data.frame(review = c(
         "This is great",
         "I hate it",
         "Love it",
         "Really like it",
         "I hate it",
         "I like it a lot",
         "I love it",
         "I do like it",
         "I really hate it",
         "I love it"), stringsAsFactors = FALSE)

 # Use a categorical hash transform which generated 128 features.
 outModel1 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7)))
 summary(outModel1)

 # Apply a categorical hash transform and a count feature selection transform
 # which selects only those hash features that has value.
 outModel2 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(
   categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7), 
   selectFeatures("reviewCatHash", mode = minCount())))
 summary(outModel2)

 # Apply a categorical hash transform and a mutual information feature selection transform
 # which selects those features appearing with at least a count of 5.
 outModel3 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(
   categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7), 
   selectFeatures("reviewCatHash", mode = minCount(count = 5))))
 summary(outModel3)