Understand PCA and import the dataset

Completed

Principal component analysis (PCA) is an algorithm that helps us get a dataset into working condition by removing dimensions that must be calculated in an analysis of the dataset.

PCA in theory

One way we reduce the number of dimensions we have to work with is by reducing the number of features considered in an analysis. PCA provides another way: reducing the number of dimensions that we have to work with by projecting our feature space into a lower-dimensional space. We can do this because in most real-world problems, data points aren't spread uniformly across all dimensions. Although some features might be nearly constant, others are highly correlated. The highly correlated data points lie close to a lower-dimensional subspace.

In the following image, the data points aren't spread across the entire plane, but are clumped roughly in an oval. Because the cluster (or any cluster) is roughly elliptical, it can be mathematically described by two values: its major (long) axis and its minor (short) axis. These axes form the principal components of the cluster.

Screenshot of two columns of data plots. The second column shows three dimensions of data that appear to overlap in the data plot in the first column.

We can construct a whole new feature space around this cluster that's defined by two eigenvectors: $c_{1}$ and $c_{2}$. Eigenvectors are the vectors that define the linear transformation to this new feature space.

Better still, we don't have to consider all of the dimensions of this new space. Intuitively, we can see that most of the points lie on or close to the line that runs through $c_{1}$. If we project the cluster down from two dimensions to that single dimension, we capture most of the information about this dataset while simplifying our analysis. This ability to extract most of the information from a dataset by considering only a fraction of its definitive eigenvectors forms the heart of PCA.

Import modules and dataset

You must first clean and prepare the data to conduct PCA on it, so pandas will be essential. You also need NumPy, a bit of scikit-learn, and pyplot.

To add these libraries, run this code:

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline

The dataset we'll use here's the same one that's drawn from the U.S. Department of Agriculture National Nutrient Database for Standard Reference that you prepared in the preceding module.

Remember to set the encoding to latin1 (for µg):

df = pd.read_csv('Data/USDA-nndb-combined.csv', encoding='latin1')

We can check the number of columns and rows by using the info() method for the DataFrame:

df.info()

The output is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8989 entries, 0 to 8988
Data columns (total 54 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   NDB_No             8989 non-null   int64  
 1   FoodGroup          8618 non-null   object 
 2   Shrt_Desc          8790 non-null   object 
 3   Water_(g)          8789 non-null   float64
 4   Energ_Kcal         8790 non-null   float64
 5   Protein_(g)        8790 non-null   float64
 6   Lipid_Tot_(g)      8790 non-null   float64
 7   Ash_(g)            8465 non-null   float64
 8   Carbohydrt_(g)     8790 non-null   float64
 9   Fiber_TD_(g)       8196 non-null   float64
 10  Sugar_Tot_(g)      6958 non-null   float64
 11  Calcium_(mg)       8442 non-null   float64
 12  Iron_(mg)          8646 non-null   float64
 13  Magnesium_(mg)     8051 non-null   float64
 14  Phosphorus_(mg)    8211 non-null   float64
 15  Potassium_(mg)     8364 non-null   float64
 16  Sodium_(mg)        8707 non-null   float64
 17  Zinc_(mg)          8084 non-null   float64
 18  Copper_mg)         7533 non-null   float64
 19  Manganese_(mg)     6630 non-null   float64
 20  Selenium_(µg)     7090 non-null   float64
 21  Vit_C_(mg)         7972 non-null   float64
 22  Thiamin_(mg)       8156 non-null   float64
 23  Riboflavin_(mg)    8174 non-null   float64
 24  Niacin_(mg)        8153 non-null   float64
 25  Panto_Acid_mg)     6548 non-null   float64
 26  Vit_B6_(mg)        7885 non-null   float64
 27  Folate_Tot_(µg)   7529 non-null   float64
 28  Folic_Acid_(µg)   6751 non-null   float64
 29  Food_Folate_(µg)  7022 non-null   float64
 30  Folate_DFE_(µg)   6733 non-null   float64
 31  Choline_Tot_ (mg)  4774 non-null   float64
 32  Vit_B12_(µg)      7597 non-null   float64
 33  Vit_A_IU           8079 non-null   float64
 34  Vit_A_RAE          7255 non-null   float64
 35  Retinol_(µg)      6984 non-null   float64
 36  Alpha_Carot_(µg)  5532 non-null   float64
 37  Beta_Carot_(µg)   5628 non-null   float64
 38  Beta_Crypt_(µg)   5520 non-null   float64
 39  Lycopene_(µg)     5498 non-null   float64
 40  Lut+Zea_ (µg)     5475 non-null   float64
 41  Vit_E_(mg)         5901 non-null   float64
 42  Vit_D_µg          5528 non-null   float64
 43  Vit_D_IU           5579 non-null   float64
 44  Vit_K_(µg)        5227 non-null   float64
 45  FA_Sat_(g)         8441 non-null   float64
 46  FA_Mono_(g)        8124 non-null   float64
 47  FA_Poly_(g)        8125 non-null   float64
 48  Cholestrl_(mg)     8380 non-null   float64
 49  GmWt_1             8490 non-null   float64
 50  GmWt_Desc1         8491 non-null   object 
 51  GmWt_2             4825 non-null   float64
 52  GmWt_Desc2         4825 non-null   object 
 53  Refuse_Pct         8740 non-null   float64
dtypes: float64(49), int64(1), object(4)
memory usage: 3.7+ MB

Try it yourself

Can you think of a more concise way to check the number of rows and columns in a DataFrame?

Use one of the attributes of the DataFrame.

Here's a possible solution:

df.count(axis='columns')

The output is:

0       54
1       54
2       54
3       54
4       54
        ..
8984    49
8985    50
8986    53
8987    48
8988    49
Length: 8989, dtype: int64