Customize outputs in batch deployments
APPLIES TO:
Azure CLI ml extension v2 (current)
Python SDK azure-ai-ml v2 (current)
Sometimes you need to execute inference having a higher control of what is being written as output of the batch job. Those cases include:
- You need to control how the predictions are being written in the output. For instance, you want to append the prediction to the original data (if data is tabular).
- You need to write your predictions in a different file format from the one supported out-of-the-box by batch deployments.
- Your model is a generative model that can't write the output in a tabular format. For instance, models that produce images as outputs.
- Your model produces multiple tabular files instead of a single one. This is the case for instance of models that perform forecasting considering multiple scenarios.
In any of those cases, Batch Deployments allow you to take control of the output of the jobs by allowing you to write directly to the output of the batch deployment job. In this tutorial, we'll see how to deploy a model to perform batch inference and writes the outputs in parquet
format by appending the predictions to the original input data.
About this sample
This example shows how you can deploy a model to perform batch inference and customize how your predictions are written in the output. This example uses a model based on the UCI Heart Disease Data Set. The database contains 76 attributes, but we are using a subset of 14 of them. The model tries to predict the presence of heart disease in a patient. It is integer valued from 0 (no presence) to 1 (presence).
The model has been trained using an XGBBoost
classifier and all the required preprocessing has been packaged as a scikit-learn
pipeline, making this model an end-to-end pipeline that goes from raw data to predictions.
The example in this article is based on code samples contained in the azureml-examples repository. To run the commands locally without having to copy/paste YAML and other files, first clone the repo and then change directories to the folder:
git clone https://github.com/Azure/azureml-examples --depth 1
cd azureml-examples/cli
The files for this example are in:
cd endpoints/batch/deploy-models/custom-outputs-parquet
Follow along in Jupyter Notebooks
You can follow along this sample in a Jupyter Notebook. In the cloned repository, open the notebook: custom-output-batch.ipynb.
Prerequisites
Before following the steps in this article, make sure you have the following prerequisites:
An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning.
An Azure Machine Learning workspace. If you don't have one, use the steps in the How to manage workspaces article to create one.
Ensure you have the following permissions in the workspace:
Create/manage batch endpoints and deployments: Use roles Owner, contributor, or custom role allowing
Microsoft.MachineLearningServices/workspaces/batchEndpoints/*
.Create ARM deployments in the workspace resource group: Use roles Owner, contributor, or custom role allowing
Microsoft.Resources/deployments/write
in the resource group where the workspace is deployed.
You will need to install the following software to work with Azure Machine Learning:
The Azure CLI and the
ml
extension for Azure Machine Learning.az extension add -n ml
Note
Pipeline component deployments for Batch Endpoints were introduced in version 2.7 of the
ml
extension for Azure CLI. Useaz extension update --name ml
to get the last version of it.
Connect to your workspace
The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section, we'll connect to the workspace in which you'll perform deployment tasks.
Pass in the values for your subscription ID, workspace, location, and resource group in the following code:
az account set --subscription <subscription>
az configure --defaults workspace=<workspace> group=<resource-group> location=<location>
Creating a batch deployment with a custom output
In this example, we are going to create a deployment that can write directly to the output folder of the batch deployment job. The deployment will use this feature to write custom parquet files.
Registering the model
Batch Endpoint can only deploy registered models. In this case, we already have a local copy of the model in the repository, so we only need to publish the model to the registry in the workspace. You can skip this step if the model you are trying to deploy is already registered.
MODEL_NAME='heart-classifier-sklpipe'
az ml model create --name $MODEL_NAME --type "custom_model" --path "model"
Creating a scoring script
We need to create a scoring script that can read the input data provided by the batch deployment and return the scores of the model. We are also going to write directly to the output folder of the job. In summary, the proposed scoring script does as follows:
- Reads the input data as CSV files.
- Runs an MLflow model
predict
function over the input data. - Appends the predictions to a
pandas.DataFrame
along with the input data. - Writes the data in a file named as the input file, but in
parquet
format.
code/batch_driver.py
import os
import pickle
import glob
import pandas as pd
from pathlib import Path
from typing import List
def init():
global model
global output_path
# AZUREML_MODEL_DIR is an environment variable created during deployment
# It is the path to the model folder
# Please provide your model's folder name if there's one:
output_path = os.environ["AZUREML_BI_OUTPUT_PATH"]
model_path = os.environ["AZUREML_MODEL_DIR"]
model_file = glob.glob(f"{model_path}/*/*.pkl")[-1]
with open(model_file, "rb") as file:
model = pickle.load(file)
def run(mini_batch: List[str]):
for file_path in mini_batch:
data = pd.read_csv(file_path)
pred = model.predict(data)
data["prediction"] = pred
output_file_name = Path(file_path).stem
output_file_path = os.path.join(output_path, output_file_name + ".parquet")
data.to_parquet(output_file_path)
return mini_batch
Remarks:
- Notice how the environment variable
AZUREML_BI_OUTPUT_PATH
is used to get access to the output path of the deployment job. - The
init()
function is populating a global variable calledoutput_path
that can be used later to know where to write. - The
run
method returns a list of the processed files. It is required for therun
function to return alist
or apandas.DataFrame
object.
Warning
Take into account that all the batch executors will have write access to this path at the same time. This means that you need to account for concurrency. In this case, we are ensuring each executor writes its own file by using the input file name as the name of the output folder.
Creating the endpoint
We are going to create a batch endpoint named heart-classifier-batch
where to deploy the model.
Decide on the name of the endpoint. The name of the endpoint will end-up in the URI associated with your endpoint. Because of that, batch endpoint names need to be unique within an Azure region. For example, there can be only one batch endpoint with the name
mybatchendpoint
inwestus2
.Configure your batch endpoint
Create the endpoint:
Creating the deployment
Follow the next steps to create a deployment using the previous scoring script:
First, let's create an environment where the scoring script can be executed:
Create the deployment. Notice that now
output_action
is set toSUMMARY_ONLY
.Note
This example assumes you have aa compute cluster with name
batch-cluster
. Change that name accordinly.To create a new deployment under the created endpoint, create a
YAML
configuration like the following. You can check the full batch endpoint YAML schema for extra properties.$schema: https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.json endpoint_name: heart-classifier-batch name: classifier-xgboost-custom description: A heart condition classifier based on XGBoost and Scikit-Learn pipelines that append predictions on parquet files. type: model model: azureml:heart-classifier-sklpipe@latest environment: name: batch-mlflow-xgboost image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest conda_file: environment/conda.yaml code_configuration: code: code scoring_script: batch_driver.py compute: azureml:batch-cluster resources: instance_count: 2 settings: max_concurrency_per_instance: 2 mini_batch_size: 2 output_action: summary_only retry_settings: max_retries: 3 timeout: 300 error_threshold: -1 logging_level: info
Then, create the deployment with the following command:
az ml batch-deployment create --file deployment.yml --endpoint-name $ENDPOINT_NAME --set-default
At this point, our batch endpoint is ready to be used.
Testing out the deployment
For testing our endpoint, we are going to use a sample of unlabeled data located in this repository and that can be used with the model. Batch endpoints can only process data that is located in the cloud and that is accessible from the Azure Machine Learning workspace. In this example, we are going to upload it to an Azure Machine Learning data store. Particularly, we are going to create a data asset that can be used to invoke the endpoint for scoring. However, notice that batch endpoints accept data that can be placed in multiple type of locations.
Let's invoke the endpoint with data from a storage account:
A batch job is started as soon as the command returns. You can monitor the status of the job until it finishes:
Analyzing the outputs
The job generates a named output called score
where all the generated files are placed. Since we wrote into the directory directly, one file per each input file, then we can expect to have the same number of files. In this particular example we decided to name the output files the same as the inputs, but they will have a parquet extension.
Note
Notice that a file predictions.csv
is also included in the output folder. This file contains the summary of the processed files.
You can download the results of the job by using the job name:
To download the predictions, use the following command:
az ml job download --name $JOB_NAME --output-name score --download-path ./
Once the file is downloaded, you can open it using your favorite tool. The following example loads the predictions using Pandas
dataframe.
import pandas as pd
import glob
output_files = glob.glob("named-outputs/score/*.parquet")
score = pd.concat((pd.read_parquet(f) for f in output_files))
score
The output looks as follows:
age | sex | ... | thal | prediction |
---|---|---|---|---|
63 | 1 | ... | fixed | 0 |
67 | 1 | ... | normal | 1 |
67 | 1 | ... | reversible | 0 |
37 | 1 | ... | normal | 0 |
Clean up resources
Run the following code to delete the batch endpoint and all the underlying deployments. Batch scoring jobs won't be deleted.
az ml batch-endpoint delete --name $ENDPOINT_NAME --yes
Next steps
Feedback
Submit and view feedback for