Explore data with NumPy and Pandas

Completed

Data scientists can use various tools and techniques to explore, visualize, and manipulate data. One of the most common ways in which data scientists work with data is to use the Python language and some specific packages for data processing.

What is NumPy?

NumPy is a Python library that provides functionality comparable to mathematical tools such as MATLAB and R. While NumPy significantly simplifies the user experience, it also offers comprehensive mathematical functions.

What is Pandas?

Pandas is an extremely popular Python library for data analysis and manipulation. Pandas is like a spreadsheet application for Python, providing easy-to-use functionality for data tables.

Diagram of a Pandas dataframe.

Explore data in a notebook

Notebooks are a popular way of running basic scripts using your web browser. Typically, these notebooks are a single webpage, broken up into text sections and code sections that can be run individually.

Testing hypotheses

Data exploration and analysis is typically an iterative process, in which the data scientist takes a sample of data and performs the following kinds of tasks to analyze it and test hypotheses:

  • Clean data to handle errors, missing values, and other issues.
  • Apply statistical techniques to better understand the data and how the sample might be expected to represent the real-world population of data, allowing for random variation.
  • Visualize data to determine relationships between variables, and in the case of a machine learning project, identify features that are potentially predictive of the label.
  • Revise the hypothesis and repeat the process.