Summary
We covered much ground. Let's review some of the key messages.
What are data categories?
Data fall into several conceptual categories. The most common are:
- Continuous data (numbers).
- Categorical data that has no order.
- Ordinal data, which can be treated as numbers or ordered categories.
Computers store data as distinct types, and we usually try to match the datatype to the specific data. For example, to store continuous data, floating-point numbers work best because floating-point numbers allow the storage of fractions. By contrast, categorical data often arrive as strings (text) and must be converted to one-hot vectors for the computer to understand them properly.
What makes a good dataset?
We learned that a dataset is helpful if:
- It contains relevant information.
- It's complete.
- It's a good representation of the population (real-world).
If we must deal with incomplete data, we can take steps to make sure that incomplete data doesn't cause significant issues. When doing so, we must avoid introduction of new issues, such as changes that would make the data no longer representative.
Thinking about data
We showed how data visualization can help to build an understanding of what might be useful in a model. Use of different types of graphs, colors, and the like can be fun and make complex information much more intuitive.
We learned that understanding our data lets us make better decisions about our models. In the final exercise, we improved our model as we explored the count of cabins on the Titanic, and we considered how this information helped us. Yet overall, we found that we could improve this through simplification into nine Deck labels.