Team Data Science Process for data scientists

This article provides guidance to a set of objectives that are typically used to implement comprehensive data science solutions with Azure technologies. You are guided through:

  • Understanding an analytics workload
  • Using the Team Data Science Process
  • Using Azure Machine Learning
  • Understanding the foundations of data transfer and storage
  • Providing data source documentation
  • Using tools for analytics processing

These training materials are related to the Team Data Science Process (TDSP) and Microsoft and open-source software and toolkits, which are helpful for envisioning, executing and delivering data science solutions.

Lesson Path

You can use the items in the following table to guide your own self-study. Read the Description column to follow the path, click on the Topic links for study references, and check your skills using the Knowledge Check column.

Objective Topic Description Knowledge Check
Understand the processes for developing analytic projects An introduction to the Team Data Science Process We begin by covering an overview of the Team Data Science Process – the TDSP. This process guides you through each step of an analytics project. Read through each of these sections to learn more about the process and how you can implement it. Review and download the TDSP Project Structure artifacts to your local machine for your project.
Agile Development The Team Data Science Process works well with many different programming methodologies. In this Learning Path, we use Agile software development. Read through the "What is Agile Development?" and "Building Agile Culture" articles, which cover the basics of working with Agile. There are also other references at this site where you can learn more. Explain Continuous Integration and Continuous Delivery to a colleague.
DevOps for Data Science Developer Operations (DevOps) involves people, processes, and platforms you can use to work through a project and integrate your solution into an organization's standard IT. This integration is essential for adoption, safety, and security. In this online course, you learn about DevOps practices as well as understand some of the toolchain options you have. Prepare a 30-minute presentation to a technical audience on how DevOps is essential for analytics projects.
Understand the Technologies for Data Storage and Processing Microsoft Business Analytics and AI We focus on a few technologies in this Learning Path that you can use to create an analytics solution, but Microsoft has many more. To understand the options you have, it's important to review the platforms and features available in Microsoft Azure, the Azure Stack, and on-premises options. Review this resource to learn the various tools you have available to answer analytics question. Download and review the presentation materials from this workshop.
Setup and Configure your training, development, and production environments Microsoft Azure Now let's create an account in Microsoft Azure for training and learn how to create development and test environments. These free training resources get you started. Complete the "Beginner" and "Intermediate" paths. If you do not have an Azure Account, create one. Log in to the Microsoft Azure portal and create one Resource Group for training.
The Microsoft Azure Command-Line Interface (CLI) There are multiple ways of working with Microsoft Azure – from graphical tools like VSCode and Visual Studio, to Web interfaces such as the Azure portal, and from the command line, such as Azure PowerShell commands and functions. In this article, we cover the Command-Line Interface (CLI), which you can use locally on your workstation, in Windows and other Operating Systems, as well as in the Azure portal. Set your default subscription with the Azure CLI.
Microsoft Azure Storage You need a place to store your data. In this article, you learn about Microsoft Azure's storage options, how to create a storage account, and how to copy or move data to the cloud. Read through this introduction to learn more. Create a Storage Account in your training Resource Group, create a container for a Blob object, and upload and download data.
Microsoft Azure Active Directory Microsoft Azure Active Directory (Azure AD) forms the basis of securing your application. In this article, you learn more about accounts, rights, and permissions. Active Directory and security are complex topics, so just read through this resource to understand the fundamentals. Add one user to Azure Active Directory. NOTE: You may not have permissions for this action if you are not the administrator for the subscription. If that's the case, simply review this tutorial to learn more.
The Microsoft Azure Data Science Virtual Machine You can install the tools for working with Data Science locally on multiple operating systems. But the Microsoft Azure Data Science Virtual Machine (DSVM) contains all of the tools you need and plenty of project samples to work with. In this article, you learn more about the DVSM and how to work through its examples. This resource explains the Data Science Virtual Machine, how you can create one, and a few options for developing code with it. It also contains all the software you need to complete this learning path – so make sure you complete the Knowledge Path for this topic. Create a Data Science Virtual Machine and work through at least one lab.
Install and Understand the tools and technologies for working with Data Science solutions Working with git To follow our DevOps process with the TDSP, we need to have a version-control system. Microsoft Azure Machine Learning uses git, a popular open-source distributed repository system. In this article, you learn more about how to install, configure, and work with git and a central repository – GitHub. Clone this GitHub project for your learning path project structure.
VSCode VSCode is a cross-platform Integrated Development Environment (IDE) that you can use with multiple languages and Azure tools. You can use this single environment to create your entire solution. Watch these introductory videos to get started. Install VSCode, and work through the VS Code features in the Interactive Editor Playground.
Programming with Python In this solution we use Python, one of the most popular languages in Data Science. This article covers the basics of writing analytic code with Python, and resources to learn more. Work through sections 1-9 of this reference, then check your knowledge. Add one entity to an Azure Table using Python.
Working with Notebooks Notebooks are a way of introducing text and code in the same document. Azure Machine Learning work with Notebooks, so it is beneficial to understand how to use them. Read through this tutorial and give it a try in the Knowledge Check section. Open this page, and click on the "Welcome to Python.ipynb" link. Work through the examples on that page.
Machine Learning Creating advanced Analytic solutions involves working with data, using Machine Learning, which also forms the basis of working with Artificial Intelligence and Deep Learning. This course teaches you more about Machine Learning. For a comprehensive course on Data Science, check out this certification. Locate a resource on Machine Learning Algorithms. (Hint: Search on "azure machine learning algorithm cheat sheet")
scikit-learn The scikit-learn set of tools allows you to perform data science tasks in Python. We use this framework in our solution. This article covers the basics and explains where you can learn more. Using the Iris dataset, persist an SVM model using Pickle.
Working with Docker Docker is a distributed platform used to build, ship, and run applications, and is used frequently in Azure Machine Learning. This article covers the basics of this technology and explains where you can go to learn more. Open Visual Studio Code, and install the Docker Extension. Create a simple Node Docker container.
HDInsight HDInsight is the Hadoop open-source infrastructure, available as a service in Microsoft Azure. Your Machine Learning algorithms may involve large sets of data, and HDInsight has the ability to store, transfer and process data at large scale. This article covers working with HDInsight. Create a small HDInsight cluster. Use HiveQL statements to project columns onto an /example/data/sample.log file. Alternatively, you can complete this knowledge check on your local system.
Create a Data Processing Flow from Business Requirements Determining the Question, following the TDSP With the development environment installed and configured, and the understanding of the technologies and processes in place, it's time to put everything together using the TDSP to perform an analysis. We need to start by defining the question, selecting the data sources, and the rest of the steps in the Team Data Science Process. Keep in mind the DevOps process as we work through this process. In this article, you learn how to take the requirements from your organization and create a data flow map through your application to define your solution using the Team Data Science Process Locate a resource on "The 5 data science questions" and describe one question your organization might have in these areas. Which algorithms should you focus on for that question?
Use Azure Machine Learning to create a predictive solution Azure Machine Learning Microsoft Azure Machine Learning uses AI for data wrangling and feature engineering, manages experiments, and tracks model runs. All of this works in a single environment and most functions can run locally or in Azure. You can use the PyTorch, TensorFlow, and other frameworks to create your experiments. In this article, we focus on a complete example of this process, using everything you've learned so far.
Use Power BI to visualize results Power BI Power BI is Microsoft's data visualization tool. It is available on multiple platforms from Web to mobile devices and desktop computers. In this article, you learn how to work with the output of the solution you've created by accessing the results from Azure storage and creating visualizations using Power BI. Complete this tutorial on Power BI. Then connect Power BI to the Blob CSV created in an experiment run.
Monitor your Solution Application Insights There are multiple tools you can use to monitor your end solution. Azure Application Insights makes it easy to integrate built-in monitoring into your solution. Set up Application Insights to monitor an Application.
Azure Monitor logs Another method to monitor your application is to integrate it into your DevOps process. The Azure Monitor logs system provides a rich set of features to help you watch your analytic solutions after you deploy them. Complete this tutorial on using Azure Monitor logs.
Complete this Learning Path Congratulations! You've completed this learning path.


This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps

See Team Data Science Process for Developer Operations. This article explores the Developer Operations (DevOps) functions that are specific to an Advanced Analytics and Cognitive Services solution implementation.