Introduction
Machine learning gets its predictive power from the data that shapes it. To build effective models, you must understand the data you use.
Here, we explore how both humans and computers categorize, store, and interpret data. We examine what makes a good dataset, and how to fix issues in our available data. We also practice exploration of new data, and we see how deep thinking about a dataset can help us build better predictive models.
Scenario: the last voyage of the Titanic
As an eager marine archaeologist, you have an unusually keen interest in maritime disasters. Late one night, while clicking between images of whale bones and ancient scrolls about Atlantis, you find a public dataset that lists known passengers and crew of the first, and last, voyage of the Titanic. Drawn in by the balance between fate and chance, you wonder, what factors determined the survival of a Titanic passenger? Data from this period are somewhat incomplete. Much information for certain passengers is unknown. You must find ways to patch up this data before you can fully analyze it.
Prerequisites
- Some familiarity with machine learning concepts (such as models and cost) helps, but it's not required.
Learning objectives
In this module, you will:
- Visualize large datasets with Exploratory Data Analysis (EDA).
- Clean the errors from a dataset.
- Predict unknown values with numeric and categorical data.