mutualInformation: 기능 선택 상호 정보 모드

아티클
07/04/2024

기능 선택 변환 selectFeatures에 사용되는 기능 선택의 상호 정보 모드입니다.

사용

  mutualInformation(numFeaturesToKeep = 1000, numBins = 256, ...)

인수

`numFeaturesToKeep`

유지할 기능 수를 지정 n하면 변환은 종속 변수와 함께 상호 정보가 가장 높은 기능을 선택합니다 n . 기본값은 1000입니다.

`numBins`

숫자 값의 최대 bin 수입니다. 2의 거듭제곱을 권장합니다. 기본값은 256입니다.

`...`

Microsoft 컴퓨팅 엔진에 직접 전달할 추가 인수입니다.

세부 정보

두 임의 변수의 상호 정보이며 Y 변수 X 간의 상호 종속성의 측정값입니다. 공식적으로 상호 정보는 다음과 같이 작성할 수 있습니다.

I(X;Y) = E[log(p(x,y)) - log(p(x)) - log(p(y))]

여기서 X 및 Y의 공동 배포에 대한 기대가 나옵니다. 다음은 p(x,y) 각각의 한계 확률 밀도 함수 X 및 Yp(y) , 및 의 조인 p(x) 트 확률 밀도 함수 X Y 입니다. 일반적으로 종속 변수(또는 레이블)와 독립 변수(또는 기능) 간의 상호 정보가 높으면 레이블이 해당 기능에 대한 상호 의존도가 높아집니다.

상호 정보 기능 선택 모드는 상호 정보에 따라 기능을 선택합니다. 레이블과의 상호 정보가 가장 큰 상위 numFeaturesToKeep개 기능을 유지합니다.

값

모드를 정의하는 문자열입니다.

작성자

Microsoft Corporation Microsoft Technical Support

참조

Wikipedia: Mutual Information

참고 항목

minCount selectFeatures

예


 trainReviews <- data.frame(review = c( 
         "This is great",
         "I hate it",
         "Love it",
         "Do not like it",
         "Really like it",
         "I hate it",
         "I like it a lot",
         "I kind of hate it",
         "I do like it",
         "I really hate it",
         "It is very good",
         "I hate it a bunch",
         "I love it a bunch",
         "I hate it",
         "I like it very much",
         "I hate it very much.",
         "I really do love it",
         "I really do hate it",
         "Love it!",
         "Hate it!",
         "I love it",
         "I hate it",
         "I love it",
         "I hate it",
         "I love it"),
      like = c(TRUE, FALSE, TRUE, FALSE, TRUE,
         FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
         FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, 
         FALSE, TRUE, FALSE, TRUE), stringsAsFactors = FALSE
     )

     testReviews <- data.frame(review = c(
         "This is great",
         "I hate it",
         "Love it",
         "Really like it",
         "I hate it",
         "I like it a lot",
         "I love it",
         "I do like it",
         "I really hate it",
         "I love it"), stringsAsFactors = FALSE)

 # Use a categorical hash transform which generated 128 features.
 outModel1 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7)))
 summary(outModel1)

 # Apply a categorical hash transform and a count feature selection transform
 # which selects only those hash features that has value.
 outModel2 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(
   categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7), 
   selectFeatures("reviewCatHash", mode = minCount())))
 summary(outModel2)

 # Apply a categorical hash transform and a mutual information feature selection transform
 # which selects those features appearing with at least a count of 5.
 outModel3 <- rxLogisticRegression(like~reviewCatHash, data = trainReviews, l1Weight = 0, 
     mlTransforms = list(
   categoricalHash(vars = c(reviewCatHash = "review"), hashBits = 7), 
   selectFeatures("reviewCatHash", mode = minCount(count = 5))))
 summary(outModel3)

다음을 통해 공유