Data science end-to-end scenario: introduction and architecture

This set of tutorials demonstrates a sample end-to-end scenario in the Fabric data science experience. You'll implement each step from data ingestion, cleansing, and preparation, to training machine learning models and generating insights, and then consume those insights using visualization tools like Power BI.

Important

Microsoft Fabric is in preview.

If you're new to Microsoft Fabric, see What is Microsoft Fabric?.

Introduction

The lifecycle of a Data science project typically includes (often, iteratively) the following steps:

  • Business understanding
  • Data acquisition
  • Data exploration, cleansing, preparation, and visualization
  • Model training and experiment tracking
  • Model scoring and generating insights.

The goals and success criteria of each stage depend on collaboration, data sharing and documentation. The Fabric data science experience consists of multiple native-built features that enable collaboration, data acquisition, sharing, and consumption in a seamless way.

In these tutorials, you take the role of a data scientist who has been given the task to explore, clean, and transform a dataset containing taxicab trip data. You then build a machine learning model to predict trip duration at scale on a large dataset.

You'll learn to perform the following activities:

  1. Use the Fabric notebooks for data science scenarios.

  2. Ingest data into a Fabric lakehouse using Apache Spark.

  3. Load existing data from the lakehouse delta tables.

  4. Clean and transform data using Apache Spark.

  5. Create experiments and runs to train a machine learning model.

  6. Register and track trained models using MLflow and the Fabric UI.

  7. Run scoring at scale and save predictions and inference results to the lakehouse.

  8. Visualize predictions in Power BI using DirectLake.

Architecture

In this tutorial series, we showcase a simplified end-to-end data science scenario that involves:

  1. Ingesting data from an external data source.
  2. Data exploration and visualization.
  3. Data cleansing, preparation, and feature engineering.
  4. Model training and evaluation.
  5. Model batch scoring and saving predictions for consumption.
  6. Visualizing prediction results.

Diagram of the Data science end-to-end scenario components.

Different components of the data science scenario

Data sources - Fabric makes it easy and quick to connect to Azure Data Services, other cloud platforms, and on-premises data sources to ingest data from. Using Fabric Notebooks you can ingest data from the built-in Lakehouse, Data Warehouse, Power BI datasets and various Apache Spark and Python supported custom data sources. This tutorial series focuses on ingesting and loading data from a lakehouse.

Explore, clean, and prepare - The data science experience on Fabric supports data cleansing, transformation, exploration and featurization by using built-in experiences on Spark as well as Python based tools like Data Wrangler and SemPy Library. This tutorial will showcase data exploration using Python library seaborn and data cleansing and preparation using Apache Spark.

Models and experiments - Fabric enables you to train, evaluate, and score machine learning models by using built-in experiment and model items with seamless integration with MLflow for experiment tracking and model registration/deployment. Fabric also features capabilities for model prediction at scale (PREDICT) to gain and share business insights.

Storage - Fabric standardizes on Delta Lake, which means all the engines of Fabric can interact with the same dataset stored in a lakehouse. This storage layer allows you to store both structured and unstructured data that support both file-based storage and tabular format. The datasets and files stored can be easily accessed via all Fabric experience items like notebooks and pipelines.

Expose analysis and insights - Data from a lakehouse can be consumed by Power BI, industry leading business intelligence tool, for reporting and visualization. Data persisted in the lakehouse can also be visualized in notebooks using Spark or Python native visualization libraries like matplotlib, seaborn, plotly, and more. Data can also be visualized using the SemPy library that supports built-in rich, task-specific visualizations for the semantic data model, for dependencies and their violations, and for classification and regression use cases.

Next steps