Interpret the results

Completed

What do our vectors mean? Put another way, what kinds of foods populate the different clusters we've discovered among the data?

To see the results, we'll create a pandas series for each component, index the components by feature, and then sort them in decreasing order. A higher number represents a feature that is positively correlated with that vector. Negative numbers represent low correlation.

First, run this code:

vects = fit.components_[:5]

Next, run this code:

c1 = pd.Series(vects[0], index=nutr_df.columns)
c1.sort_values(ascending=False)

The output is:

Protein_(g)          0.253011
Selenium_(µg)       0.237214
Zinc_(mg)            0.233275
Choline_Tot_ (mg)    0.227019
Phosphorus_(mg)      0.224003
Niacin_(mg)          0.212308
Riboflavin_(mg)      0.206798
Panto_Acid_mg)       0.205353
Cholestrl_(mg)       0.202130
FA_Mono_(g)          0.199087
Lipid_Tot_(g)        0.197132
Vit_B12_(µg)        0.196320
Vit_B6_(mg)          0.193737
FA_Sat_(g)           0.192418
Iron_(mg)            0.162307
FA_Poly_(g)          0.161677
Energ_Kcal           0.159268
Ash_(g)              0.146991
Magnesium_(mg)       0.143715
Potassium_(mg)       0.142175
Thiamin_(mg)         0.138838
Vit_D_µg            0.138423
Retinol_(µg)        0.115504
Sodium_(mg)          0.104025
Copper_mg)           0.095703
Calcium_(mg)         0.058003
Vit_E_(mg)           0.040701
Food_Folate_(µg)    0.017723
Folate_Tot_(µg)     0.016790
Folic_Acid_(µg)     0.001449
Manganese_(mg)      -0.035053
Vit_K_(µg)         -0.035386
Vit_A_IU            -0.051647
Lycopene_(µg)      -0.054287
Beta_Crypt_(µg)    -0.084480
Alpha_Carot_(µg)   -0.093607
Fiber_TD_(g)        -0.098041
Water_(g)           -0.107941
Carbohydrt_(g)      -0.122976
Lut+Zea_ (µg)      -0.123936
Beta_Carot_(µg)    -0.144259
Sugar_Tot_(g)       -0.145755
Vit_C_(mg)          -0.160070
dtype: float64

Our first cluster is defined by foods that are high in protein and minerals, like selenium and zinc, while also being low in sugars and vitamin C. Even to a nonspecialist, these foods appear to be meat, poultry, or legumes.

Tip

Takeaway

Particularly with interpretation, subject matter expertise can prove essential to producing high-quality analysis. For this reason, you should also try to include SMEs in your data science projects.

Then, run this code:

c2 = pd.Series(vects[1], index=nutr_df.columns)
c2.sort_values(ascending=False)

The output is:

Manganese_(mg)       0.298009
Fiber_TD_(g)         0.291384
Folate_Tot_(µg)     0.272273
Carbohydrt_(g)       0.257291
Food_Folate_(µg)    0.241234
Copper_mg)           0.225446
Magnesium_(mg)       0.213403
Calcium_(mg)         0.199649
Lut+Zea_ (µg)       0.194307
Sugar_Tot_(g)        0.183276
Ash_(g)              0.181539
Vit_E_(mg)           0.178778
Vit_K_(µg)          0.178267
Iron_(mg)            0.175137
Folic_Acid_(µg)     0.161876
Thiamin_(mg)         0.147750
Beta_Carot_(µg)     0.144678
Energ_Kcal           0.137560
FA_Poly_(g)          0.126692
Potassium_(mg)       0.125701
Vit_C_(mg)           0.101272
Alpha_Carot_(µg)    0.089881
Sodium_(mg)          0.084379
Phosphorus_(mg)      0.083271
Beta_Crypt_(µg)     0.075684
Riboflavin_(mg)      0.072510
Vit_A_IU             0.064685
Lycopene_(µg)       0.053394
Lipid_Tot_(g)        0.053251
Panto_Acid_mg)       0.033587
FA_Mono_(g)          0.027734
FA_Sat_(g)           0.010743
Niacin_(mg)          0.000861
Zinc_(mg)           -0.008836
Selenium_(µg)      -0.022615
Choline_Tot_ (mg)   -0.029344
Protein_(g)         -0.031398
Vit_B6_(mg)         -0.037995
Retinol_(µg)       -0.067326
Vit_D_µg           -0.132349
Vit_B12_(µg)       -0.158239
Cholestrl_(mg)      -0.162787
Water_(g)           -0.200880
dtype: float64

Our second group is foods that are high in fiber and folic acid and low in cholesterol.

Try it yourself

Find the sorted output for $c_{3}$, $c_{4}$, and $c_{5}$.

Hint: To find the sorted output for $c_{3}$, $c_{4}$, and $c_{5}$, remember that Python uses zero-indexing.

Here's a possible solution:

c3 = pd.Series(vects[2], index=nutr_df.columns)
c3.sort_values(ascending=False)
c4 = pd.Series(vects[3], index=nutr_df.columns)
c4.sort_values(ascending=False)
c5 = pd.Series(vects[4], index=nutr_df.columns)
c5.sort_values(ascending=False)

Even without subject matter expertise, is it possible to get a more accurate sense of the kinds of foods that each component defines? Yes! For this reason, we merged the FoodGroup column back into pca_df. We'll sort that DataFrame by the components and count the values from FoodGroup for the top items:

pca_df.sort_values(by='c1')['FoodGroup'][:500].value_counts()

The output is:

Vegetables and Vegetable Products    189
Fruits and Fruit Juices              110
Beverages                             70
Sweets                                41
Soups, Sauces, and Gravies            34
Baby Foods                            31
Fats and Oils                          8
Spices and Herbs                       5
Dairy and Egg Products                 4
Breakfast Cereals                      3
Cereal Grains and Pasta                2
Baked Products                         1
Nut and Seed Products                  1
Restaurant Foods                       1
Name: FoodGroup, dtype: int64

We can do the same thing for $c_{2}$.

pca_df.sort_values(by='c2')['FoodGroup'][:500].value_counts()

The output is:

Dairy and Egg Products         22
Fats and Oils                   5
Baby Foods                      2
Sausages and Luncheon Meats     1
Soups, Sauces, and Gravies      1
Poultry Products                1
Name: FoodGroup, dtype: int64

Try it yourself

Repeat this process for $c_{3}$, $c_{4}$, and $c_{5}$.

Here are the solutions:

  ```python
  pca_df.sort_values(by='c3')['FoodGroup'][:500].value_counts()
  ```
  ```python
  pca_df.sort_values(by='c4')['FoodGroup'][:500].value_counts()
  ```
  ```python
  pca_df.sort_values(by='c5')['FoodGroup'][:500].value_counts()
  ```

Note

The category Baby Foods and some other categories might seem to dominate the output. This is a result of all the rows we had to drop because they had NaN values. If we look at all of the value counts for the category FoodGroup, we'll see that they aren't evenly distributed. Some categories are far more represented than others.

df['FoodGroup'].value_counts()

The output is:

Beef Products                        345
Vegetables and Vegetable Products    259
Baked Products                       217
Pork Products                        165
Dairy and Egg Products               132
Poultry Products                     128
Fruits and Fruit Juices              122
Sweets                               108
Finfish and Shellfish Products        94
Beverages                             87
Soups, Sauces, and Gravies            68
Baby Foods                            64
Lamb, Veal, and Game Products         64
Legumes and Legume Products           57
Nut and Seed Products                 49
Snacks                                44
Fats and Oils                         43
Sausages and Luncheon Meats           40
Fast Foods                            24
Spices and Herbs                      22
Meals, Entrees, and Side Dishes       21
Breakfast Cereals                     20
Restaurant Foods                      10
Cereal Grains and Pasta                7
Name: FoodGroup, dtype: int64