Events
Take the Microsoft Learn Challenge
Nov 19, 11 PM - Jan 10, 11 PM
Ignite Edition - Build skills in Microsoft Azure and earn a digital badge by January 10!
Register nowThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
In this article, you learn about the available options for building a data ingestion pipeline with Azure Data Factory. This Azure Data Factory pipeline is used to ingest data for use with Azure Machine Learning. Data Factory allows you to easily extract, transform, and load (ETL) data. Once the data is transformed and loaded into storage, it can be used to train your machine learning models in Azure Machine Learning.
Simple data transformation can be handled with native Data Factory activities and instruments such as data flow. When it comes to more complicated scenarios, the data can be processed with some custom code. For example, Python or R code.
There are several common techniques of using Data Factory to transform data during ingestion. Each technique has advantages and disadvantages that help determine if it's a good fit for a specific use case:
Technique | Advantages | Disadvantages |
---|---|---|
Data Factory + Azure Functions | Only good for short running processing | |
Data Factory + custom component | ||
Data Factory + Azure Databricks notebook |
Azure Functions allows you to run small pieces of code (functions) without worrying about application infrastructure. In this option, the data is processed with custom Python code wrapped into an Azure Function.
The function is invoked with the Azure Data Factory Azure Function activity. This approach is a good option for lightweight data transformations.
In this option, the data is processed with custom Python code wrapped into an executable. It's invoked with an Azure Data Factory Custom Component activity. This approach is a better fit for large data than the previous technique.
Azure Databricks is an Apache Spark-based analytics platform in the Microsoft cloud.
In this technique, the data transformation is performed by a Python notebook, running on an Azure Databricks cluster. This is probably, the most common approach that uses the full power of an Azure Databricks service. It's designed for distributed data processing at scale.
The Data Factory pipeline saves the prepared data to your cloud storage (such as Azure Blob or Azure Data Lake).
Consume your prepared data in Azure Machine Learning by,
This method is recommended for Machine Learning Operations (MLOps) workflows. If you don't want to set up an Azure Machine Learning pipeline, see Read data directly from storage.
Each time the Data Factory pipeline runs,
Tip
Datasets support versioning, so the ML pipeline can register a new version of the dataset that points to the most recent data from the ADF pipeline.
Once the data is accessible through a datastore or dataset, you can use it to train an ML model. The training process might be part of the same ML pipeline that is called from ADF. Or it might be a separate process such as experimentation in a Jupyter notebook.
Since datasets support versioning, and each job from the pipeline creates a new version, it's easy to understand which version of the data was used to train a model.
If you don't want to create an ML pipeline, you can access the data directly from the storage account where your prepared data is saved with an Azure Machine Learning datastore and dataset.
The following Python code demonstrates how to create a datastore that connects to Azure DataLake Generation 2 storage. Learn more about datastores and where to find service principal permissions.
APPLIES TO: Python SDK azureml v1
ws = Workspace.from_config()
adlsgen2_datastore_name = '<ADLS gen2 storage account alias>' #set ADLS Gen2 storage account alias in Azure Machine Learning
subscription_id=os.getenv("ADL_SUBSCRIPTION", "<ADLS account subscription ID>") # subscription id of ADLS account
resource_group=os.getenv("ADL_RESOURCE_GROUP", "<ADLS account resource group>") # resource group of ADLS account
account_name=os.getenv("ADLSGEN2_ACCOUNTNAME", "<ADLS account name>") # ADLS Gen2 account name
tenant_id=os.getenv("ADLSGEN2_TENANT", "<tenant id of service principal>") # tenant id of service principal
client_id=os.getenv("ADLSGEN2_CLIENTID", "<client id of service principal>") # client id of service principal
client_secret=os.getenv("ADLSGEN2_CLIENT_SECRET", "<secret of service principal>") # the secret of service principal
adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(
workspace=ws,
datastore_name=adlsgen2_datastore_name,
account_name=account_name, # ADLS Gen2 account name
filesystem='<filesystem name>', # ADLS Gen2 filesystem
tenant_id=tenant_id, # tenant id of service principal
client_id=client_id, # client id of service principal
Next, create a dataset to reference the files you want to use in your machine learning task.
The following code creates a TabularDataset from a csv file, prepared-data.csv
. Learn more about dataset types and accepted file formats.
APPLIES TO: Python SDK azureml v1
from azureml.core import Workspace, Datastore, Dataset
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
# retrieve data via Azure Machine Learning datastore
datastore = Datastore.get(ws, adlsgen2_datastore)
datastore_path = [(datastore, '/data/prepared-data.csv')]
prepared_dataset = Dataset.Tabular.from_delimited_files(path=datastore_path)
From here, use prepared_dataset
to reference your prepared data, like in your training scripts. Learn how to Train models with datasets in Azure Machine Learning.
Events
Take the Microsoft Learn Challenge
Nov 19, 11 PM - Jan 10, 11 PM
Ignite Edition - Build skills in Microsoft Azure and earn a digital badge by January 10!
Register nowTraining
Module
Petabyte-scale ingestion with Azure Data Factory - Training
Petabyte-scale ingestion with Azure Data Factory or Azure Synapse Pipeline
Certification
Microsoft Certified: Azure Data Scientist Associate - Certifications
Manage data ingestion and preparation, model training and deployment, and machine learning solution monitoring with Python, Azure Machine Learning and MLflow.