Unsurprisingly, the role of a data scientist primarily involves exploring and analyzing data. Although the end result of data analysis might be a report or a machine learning model, data scientists begin their work with data, with Python being the most popular programming language data scientists use for working with data.

After decades of open-source development, Python provides extensive functionality with powerful statistical and numerical libraries:

  • NumPy and Pandas simplify analyzing and manipulating data
  • Matplotlib provides attractive data visualizations
  • Scikit-learn offers simple and effective predictive data analysis
  • TensorFlow and PyTorch supply machine learning and deep learning capabilities

Example scenario

Usually, a data-analysis project is designed to establish insights around a particular scenario or to test a hypothesis.

For example, suppose a university professor collects data about their students, including the number of lectures attended, the hours spent studying, and the final grade achieved on the end of term exam. The professor could analyze the data to determine if there is a relationship between the amount of studying a student undertakes and the final grade they achieve. The professor might use the data to test a hypothesis that only students who study for a minimum number of hours can expect to achieve a passing grade.

Diagram of lecture and study time related to student grades.

What will we be doing?

In this training module, we'll explore and analyze grade data for a fictitious university class from a professor's point of view. We'll use Jupyter notebooks and several Python tools and libraries to clean the data set, apply statistical techniques to test several hypotheses about the data, and visualize the data to determine the relationships between variables.