Introduction

Completed

One of the laws of data science is the "curse of dimensionality." As the number of considered features (dimensions) of a feature space increases, the number of data configurations can grow exponentially. The number of observations (data points) needed to account for these configurations must also increase. Because this fact of life has huge ramifications for the time, computational effort, and memory required, it's often desirable to reduce the number of dimensions we have to work with.

One way to reduce the number of dimensions is to reduce the number of features considered in an analysis. After all, some features yield more insight than others for a specific analysis. Although this type of feature engineering is necessary in any data science project, we can take it only so far considering more features often increases the accuracy of a classifier. (For example, consider how many features might increase the accuracy of classifying images as cats or dogs.)

Learning objectives

In this module, you'll:

  • Learn the terms principal component analysis (PCA) and eigenvector and understand their functions in machine learning.
  • Learn about PCA theory, and then apply PCA to a food composition dataset.
  • Check for correlation, and then normalize and center the data.

Set up your environment

To learn most effectively throughout this module, we recommend that you set up your environment so you can follow along.

Complete these steps to set up your environment:

  1. Download and install Visual Studio Code. This is free and works on Windows, Mac, and Linux. Select the stable build for your platform.
  2. Download and install the Python extension for Visual Studio Code. This will include first installing a supported version of Python.
  3. Activate the Anaconda environment to be able to run Jupyter notebooks.
  4. Set up a Data Science environment to be able to use NumPy and Pandas.

Test your environment

If you have successfully set up your environment with VS Code, Python, Anaconda, and the NumPy and Pandas libraries, you should be able to run a Jupyter notebook inside of VS Code.

  1. Clone the Reactor repository and open the folder that corresponds to this module in VS Code.
  2. Run the Test-Setup-Config.ipynb file to ensure you're ready to continue through the module.

Working through this learn module

As you're working through this module, you'll be encouraged to try out code. Use the files you cloned to do this.

Jupyter Notebooks are divided into cells. Each cell contains either text written in the Markdown markup language or a space in which to write and execute computer code. Because all the code resides inside code cells, you can run each code cell inline rather than using a separate Python interactive window.

Note

This learn module is designed to have you run code cells one by one. As you complete these modules, you are encouraged to copy the code snippets into your VS Code Jupyter Notebook and run each cell one at a time.