Create jobs and input data for batch endpoints
Batch endpoints can be used to perform long batch operations over large amounts of data. Such data can be placed in different places. Some type of batch endpoints can also receive literal parameters as inputs. In this tutorial we'll cover how you can specify those inputs, and the different types or locations supported.
Before invoking an endpoint
To successfully invoke a batch endpoint and create jobs, ensure you have the following:
You have permissions to run a batch endpoint deployment. Read Authorization on batch endpoints to know the specific permissions needed.
You have a valid Microsoft Entra ID token representing a security principal to invoke the endpoint. This principal can be a user principal or a service principal. In any case, once an endpoint is invoked, a batch deployment job is created under the identity associated with the token. For testing purposes, you can use your own credentials for the invocation as mentioned below.
Use the Azure CLI to log in using either interactive or device code authentication:
az login
To learn more about how to authenticate with multiple type of credentials read Authorization on batch endpoints.
The compute cluster where the endpoint is deployed has access to read the input data.
Tip
If you are using a credential-less data store or external Azure Storage Account as data input, ensure you configure compute clusters for data access. The managed identity of the compute cluster is used for mounting the storage account. The identity of the job (invoker) is still used to read the underlying data allowing you to achieve granular access control.
Understanding inputs and outputs
Batch endpoints provide a durable API that consumers can use to create batch jobs. The same interface can be used to indicate the inputs and the outputs your deployment expects. Use inputs to pass any information your endpoint needs to perform the job.
Batch endpoints support two types of inputs:
- Data inputs, which are pointers to a specific storage location or Azure Machine Learning asset.
- Literal inputs, which are literal values (like numbers or strings) that you want to pass to the job.
The number and type of inputs and outputs depend on the type of batch deployment. Model deployments always require 1 data input and produce 1 data output. Literal inputs are not supported. However, pipeline component deployments provide a more general construct to build endpoints. You can indicate any number of inputs (data and literal) and outputs.
The following table summarizes it:
Deployment type | Input's number | Supported input's types | Output's number | Supported output's types |
---|---|---|---|---|
Model deployment | 1 | Data inputs | 1 | Data outputs |
Pipeline component deployment | [0..N] | Data inputs and literal inputs | [0..N] | Data outputs |
Tip
Inputs and outputs are always named. Those names serve as keys to indentify them and pass the actual value during invocation. For model deployments, since they always require 1 input and output, the name is ignored during invocation. You can assign the name its best describe your use case, like "sales_estimation".
Data inputs
Data inputs refer to inputs that point to a location where data is placed. Since batch endpoints usually consume large amounts of data, you can't pass the input data as part of the invocation request. Instead, you indicate the location where the batch endpoint should go to look for the data. Input data is mounted and streamed on the target compute to improve performance.
Batch endpoints support reading files located in the following storage options:
- Azure Machine Learning Data Assets, including Folder (
uri_folder
) and File (uri_file
). - Azure Machine Learning Data Stores, including Azure Blob Storage, Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2.
- Azure Storage Accounts, including Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, and Azure Blob Storage.
- Local data folders/files (Azure Machine Learning CLI or Azure Machine Learning SDK for Python). However, that operation will result in the local data to be uploaded to the default Azure Machine Learning Data Store of the workspace you are working on.
Important
Deprecation notice: Datasets of type FileDataset
(V1) are deprecated and will be retired in the future. Existing batch endpoints relying on this functionality will continue to work but batch endpoints created with GA CLIv2 (2.4.0 and newer) or GA REST API (2022-05-01 and newer) will not support V1 dataset.
Literal inputs
Literal inputs refer to inputs that can be represented and resolved at invocation time, like strings, numbers, and boolean values. You typically use literal inputs to pass parameters to your endpoint as part of a pipeline component deployment. Batch endpoints support the following literal types:
string
boolean
float
integer
Literal inputs are only supported in Pipeline Component deployments. See Create jobs with literal inputs to learn how to indicate them.
Data outputs
Data outputs refer to the location where the results of a batch job should be placed. Outputs are identified by name and Azure Machine Learning automatically assign a unique path to each named output. However, you can indicate another path if required. Batch Endpoints only support writing outputs in blob Azure Machine Learning data stores.
Create jobs with data inputs
The following examples show how to create jobs taking data inputs from data assets, data stores, and Azure Storage Accounts.
Input data from a data asset
Azure Machine Learning data assets (formerly known as datasets) are supported as inputs for jobs. Follow these steps to run a batch endpoint job using data stored in a registered data asset in Azure Machine Learning:
Warning
Data assets of type Table (MLTable
) aren't currently supported.
Let's create the data asset first. This data asset consists of a folder with multiple CSV files that we want to process in parallel using batch endpoints. You can skip this step is your data is already registered as a data asset.
Create a data asset definition in
YAML
:heart-dataset-unlabeled.yml
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json name: heart-dataset-unlabeled description: An unlabeled dataset for heart classification. type: uri_folder path: heart-classifier-mlflow/data
Then, create the data asset:
az ml data create -f heart-dataset-unlabeled.yml
Create the input or request:
DATASET_ID=$(az ml data show -n heart-dataset-unlabeled --label latest | jq -r .id)
Note
Data assets ID would look like
/subscriptions/<subscription>/resourcegroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/data/<data-asset>/versions/<version>
. You can also useazureml:/<datasset_name>@latest
as a way to indicate the input.Run the endpoint:
Use the argument
--set
to indicate the input:az ml batch-endpoint invoke --name $ENDPOINT_NAME \ --set inputs.heart_dataset.type uri_folder inputs.heart_dataset.path $DATASET_ID
If your endpoint serves a model deployment, you can use the short form which supports only 1 input:
az ml batch-endpoint invoke --name $ENDPOINT_NAME --input $DATASET_ID
The argument
--set
tends to produce long commands when multiple inputs are indicated. On those cases, place your inputs in aYAML
file and use--file
to indicate the inputs you need for your endpoint invocation.inputs.yml
inputs: heart_dataset: azureml:/<datasset_name>@latest
az ml batch-endpoint invoke --name $ENDPOINT_NAME --file inputs.yml
Input data from data stores
Data from Azure Machine Learning registered data stores can be directly referenced by batch deployments jobs. In this example, we're going to first upload some data to the default data store in the Azure Machine Learning workspace and then run a batch deployment on it. Follow these steps to run a batch endpoint job using data stored in a data store:
Let's get access to the default data store in the Azure Machine Learning workspace. If your data is in a different store, you can use that store instead. There's no requirement of using the default data store.
DATASTORE_ID=$(az ml datastore show -n workspaceblobstore | jq -r '.id')
Note
Data stores ID would look like
/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/datastores/<data-store>
.Tip
The default blob data store in a workspace is called workspaceblobstore. You can skip this step if you already know the resource ID of the default data store in your workspace.
We'll need to upload some sample data to it. This example assumes you've uploaded the sample data included in the repo in the folder
sdk/python/endpoints/batch/deploy-models/heart-classifier-mlflow/data
in the folderheart-disease-uci-unlabeled
in the blob storage account. Ensure you have done that before moving forward.Create the input or request:
Let's place the file path in the following variable:
DATA_PATH="heart-disease-uci-unlabeled" INPUT_PATH="$DATASTORE_ID/paths/$DATA_PATH"
Note
See how the path
paths
is appended to the resource id of the data store to indicate that what follows is a path inside of it.Tip
You can also use
azureml://datastores/<data-store>/paths/<data-path>
as a way to indicate the input.Run the endpoint:
Use the argument
--set
to indicate the input:az ml batch-endpoint invoke --name $ENDPOINT_NAME \ --set inputs.heart_dataset.type uri_folder inputs.heart_dataset.path $INPUT_PATH
If your endpoint serves a model deployment, you can use the short form which supports only 1 input:
az ml batch-endpoint invoke --name $ENDPOINT_NAME --input $INPUT_PATH --input-type uri_folder
The argument
--set
tends to produce long commands when multiple inputs are indicated. On those cases, place your inputs in aYAML
file and use--file
to indicate the inputs you need for your endpoint invocation.inputs.yml
inputs: heart_dataset: type: uri_folder path: azureml://datastores/<data-store>/paths/<data-path>
az ml batch-endpoint invoke --name $ENDPOINT_NAME --file inputs.yml
If your data is a file, use
uri_file
as type instead.
Input data from Azure Storage Accounts
Azure Machine Learning batch endpoints can read data from cloud locations in Azure Storage Accounts, both public and private. Use the following steps to run a batch endpoint job using data stored in a storage account:
Note
Check the section configure compute clusters for data access to learn more about additional configuration required to successfully read data from storage accoutns.
Create the input or request:
Run the endpoint:
Use the argument
--set
to indicate the input:az ml batch-endpoint invoke --name $ENDPOINT_NAME \ --set inputs.heart_dataset.type uri_folder inputs.heart_dataset.path $INPUT_DATA
If your endpoint serves a model deployment, you can use the short form which supports only 1 input:
az ml batch-endpoint invoke --name $ENDPOINT_NAME --input $INPUT_DATA --input-type uri_folder
The argument
--set
tends to produce long commands when multiple inputs are indicated. On those cases, place your inputs in aYAML
file and use--file
to indicate the inputs you need for your endpoint invocation.inputs.yml
inputs: heart_dataset: type: uri_folder path: https://azuremlexampledata.blob.core.windows.net/data/heart-disease-uci/data
az ml batch-endpoint invoke --name $ENDPOINT_NAME --file inputs.yml
If your data is a file, use
uri_file
as type instead.
Create jobs with literal inputs
Pipeline component deployments can take literal inputs. The following example shows how to indicate an input named score_mode
, of type string
, with a value of append
:
Place your inputs in a YAML
file and use --file
to indicate the inputs you need for your endpoint invocation.
inputs.yml
inputs:
score_mode:
type: string
default: append
az ml batch-endpoint invoke --name $ENDPOINT_NAME --file inputs.yml
You can also use the argument --set
to indicate the value. However, it tends to produce long commands when multiple inputs are indicated:
az ml batch-endpoint invoke --name $ENDPOINT_NAME \
--set inputs.score_mode.type string inputs.score_mode.default append
Create jobs with data outputs
The following example shows how to change the location where an output named score
is placed. For completeness, these examples also configure an input named heart_dataset
.
Let's use the default data store in the Azure Machine Learning workspace to save the outputs. You can use any other data store in your workspace as long as it's a blob storage account.
Create a data output:
DATA_PATH="batch-jobs/my-unique-path" OUTPUT_PATH="$DATASTORE_ID/paths/$DATA_PATH"
For completeness, let's also create a data input:
INPUT_PATH="https://azuremlexampledata.blob.core.windows.net/data/heart-disease-uci/data"
Note
See how the path
paths
is appended to the resource id of the data store to indicate that what follows is a path inside of it.Run the deployment:
Invoke a specific deployment
Batch endpoints can host multiple deployments under the same endpoint. The default endpoint is used unless the user indicates otherwise. You can change the deployment that is used as follows:
Use the argument --deployment-name
or -d
to indicate the name of the deployment:
az ml batch-endpoint invoke --name $ENDPOINT_NAME --deployment-name $DEPLOYMENT_NAME --input $INPUT_DATA
Next steps
Feedback
Submit and view feedback for