Invoking batch endpoints from Azure Data Factory
Big data requires a service that can orchestrate and operationalize processes to refine these enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
Azure Data Factory allows the creation of pipelines that can orchestrate multiple data transformations and manage them as a single unit. Batch endpoints are an excellent candidate to become a step in such processing workflow. In this example, learn how to use batch endpoints in Azure Data Factory activities by relying on the Web Invoke activity and the REST API.
This example assumes that you have a model correctly deployed as a batch endpoint. Particularly, we are using the heart condition classifier created in the tutorial Using MLflow models in batch deployments.
An Azure Data Factory resource created and configured. If you have not created your data factory yet, follow the steps in Quickstart: Create a data factory by using the Azure portal and Azure Data Factory Studio to create one.
After creating it, browse to the data factory in the Azure portal:
Select Open on the Open Azure Data Factory Studio tile to launch the Data Integration application in a separate tab.
Authenticating against batch endpoints
Azure Data Factory can invoke the REST APIs of batch endpoints by using the Web Invoke activity. Batch endpoints support Azure Active Directory for authorization and hence the request made to the APIs require a proper authentication handling.
You can use a service principal or a managed identity to authenticate against Batch Endpoints. We recommend using a managed identity as it simplifies the use of secrets.
Batch Endpoints can consume data stored in storage accounts instead of Azure Machine Learning Data Stores or Data Assets. However, you may need to configure additional permissions for the identity of the compute where the batch endpoint runs on. See Security considerations when reading data.
You can use Azure Data Factory managed identity to communicate with Batch Endpoints. In this case, you only need to make sure that your Azure Data Factory resource was deployed with a managed identity.
If you don't have an Azure Data Factory resource or it was already deployed without a managed identity, please follow the following steps to create it: Managed identity for Azure Data Factory.
Notice that changing the resource identity once deployed is not possible in Azure Data Factory. Once the resource is created, you will need to recreate it if you need to change the identity of it.
Once deployed, grant access for the managed identity of the resource you created to your Azure Machine Learning workspace as explained at Grant access. In this example the service principal will require:
- Permission in the workspace to read batch deployments and perform actions over them.
- Permissions to read/write in data stores.
- Permissions to read in any cloud location (storage account) indicated as a data input.
About the pipeline
We are going to create a pipeline in Azure Data Factory that can invoke a given batch endpoint over some data. The pipeline will communicate with Azure Machine Learning batch endpoints using REST. To know more about how to use the REST API of batch endpoints read Deploy models with REST for batch scoring.
The pipeline will look as follows:
It is composed of the following activities:
- Run Batch-Endpoint: It's a Web Activity that uses the batch endpoint URI to invoke it. It passes the input data URI where the data is located and the expected output file.
- Wait for job: It's a loop activity that checks the status of the created job and waits for its completion, either as Completed or Failed. This activity, in turns, uses the following activities:
- Check status: It's a Web Activity that queries the status of the job resource that was returned as a response of the Run Batch-Endpoint activity.
- Wait: It's a Wait Activity that controls the polling frequency of the job's status. We set a default of 120 (2 minutes).
The pipeline requires the following parameters to be configured:
||The endpoint scoring URI||
||The API version to use with REST API calls. Defaults to
||The number of seconds to wait before checking the job status for completion. Defaults to
||The endpoint's input data. Multiple data input types are supported. Ensure that the manage identity you are using for executing the job has access to the underlying location. Alternative, if using Data Stores, ensure the credentials are indicated there.||
||The endpoint's output data file. It must be a path to an output file in a Data Store attached to the Machine Learning workspace. Not other type of URIs is supported. You can use the default Azure Machine Learning data store, named
endpoint_output_uri should be the path to a file that doesn't exist yet. Otherwise, the job will fail with the error the path already exists.
The input data URI can be a path to an Azure Machine Learning data store, data asset, or a cloud URI. Depending on the case, further configuration may be required to ensure the deployment can read the data properly. See Accessing storage services for details.
To create this pipeline in your existing Azure Data Factory, follow these steps:
Open Azure Data Factory Studio and under Factory Resources click the plus sign.
Select Pipeline > Import from pipeline template
You will be prompted to select a
zipfile. Uses the following template if using managed identities or the following one if using a service principal.
A preview of the pipeline will show up in the portal. Click Use this template.
The pipeline will be created for you with the name Run-BatchEndpoint.
Configure the parameters of the batch deployment you are using:
Ensure that your batch endpoint has a default deployment configured before submitting a job to it. The created pipeline will invoke the endpoint and hence a default deployment needs to be created and configured.
For best reusability, use the created pipeline as a template and call it from within other Azure Data Factory pipelines by leveraging the Execute pipeline activity. In that case, do not configure the parameters in the inner pipeline but pass them as parameters from the outer pipeline as shown in the following image:
- Your pipeline is ready to be used.
When calling Azure Machine Learning batch deployments consider the following limitations:
- Data inputs:
- Only Azure Machine Learning data stores or Azure Storage Accounts (Azure Blob Storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2) are supported as inputs. If your input data is in another source, use the Azure Data Factory Copy activity before the execution of the batch job to sink the data to a compatible store.
- Ensure the deployment has the required access to read the input data depending on the type of input you are using. See Accessing storage services for details.
- Data outputs:
- Only registered Azure Machine Learning data stores are supported.
- Only Azure Blob Storage Accounts are supported for outputs. For instance, Azure Data Lake Storage Gen2 isn't supported as output in batch deployment jobs. If you need to output the data to a different location/sink, use the Azure Data Factory Copy activity after the execution of the batch job.
Considerations when reading and writing data
When reading and writing data, take into account the following considerations:
- Batch endpoint jobs don't explore nested folders and hence can't work with nested folder structures. If your data is distributed in multiple folders, notice that you will have to flatten the structure.
- Make sure that your scoring script provided in the deployment can handle the data as it is expected to be fed into the job. If the model is MLflow, read the limitation in terms of the file type supported by the moment at Using MLflow models in batch deployments.
- Batch endpoints distribute and parallelize the work across multiple workers at the file level. Make sure that each worker node has enough memory to load the entire data file at once and send it to the model. Such is especially true for tabular data.
- When estimating the memory consumption of your jobs, take into account the model memory footprint too. Some models, like transformers in NLP, don't have a liner relationship between the size of the inputs and the memory consumption. On those cases, you may want to consider further partitioning your data into multiple files to allow a greater degree of parallelization with smaller files.