Deploy language models in batch endpoints

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Batch Endpoints can be used to deploy expensive models, like language models, over text data. In this tutorial, you learn how to deploy a model that can perform text summarization of long sequences of text using a model from HuggingFace. It also shows how to do inference optimization using HuggingFace optimum and accelerate libraries.

About this sample

The model we are going to work with was built using the popular library transformers from HuggingFace along with a pre-trained model from Facebook with the BART architecture. It was introduced in the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation. This model has the following constraints, which are important to keep in mind for deployment:

  • It can work with sequences up to 1024 tokens.
  • It is trained for summarization of text in English.
  • We are going to use Torch as a backend.

The example in this article is based on code samples contained in the azureml-examples repository. To run the commands locally without having to copy/paste YAML and other files, first clone the repo and then change directories to the folder:

git clone --depth 1
cd azureml-examples/cli

The files for this example are in:

cd endpoints/batch/deploy-models/huggingface-text-summarization

Follow along in Jupyter Notebooks

You can follow along this sample in a Jupyter Notebook. In the cloned repository, open the notebook: text-summarization-batch.ipynb.


Before following the steps in this article, make sure you have the following prerequisites:

  • An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning.

  • An Azure Machine Learning workspace. If you don't have one, use the steps in the Manage Azure Machine Learning workspaces article to create one.

  • Ensure that you have the following permissions in the workspace:

    • Create or manage batch endpoints and deployments: Use an Owner, Contributor, or Custom role that allows Microsoft.MachineLearningServices/workspaces/batchEndpoints/*.

    • Create ARM deployments in the workspace resource group: Use an Owner, Contributor, or Custom role that allows Microsoft.Resources/deployments/write in the resource group where the workspace is deployed.

  • You need to install the following software to work with Azure Machine Learning:

    The Azure CLI and the ml extension for Azure Machine Learning.

    az extension add -n ml


    Pipeline component deployments for Batch Endpoints were introduced in version 2.7 of the ml extension for Azure CLI. Use az extension update --name ml to get the last version of it.

Connect to your workspace

The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section, we'll connect to the workspace in which you'll perform deployment tasks.

Pass in the values for your subscription ID, workspace, location, and resource group in the following code:

az account set --subscription <subscription>
az configure --defaults workspace=<workspace> group=<resource-group> location=<location>

Registering the model

Due to the size of the model, it hasn't been included in this repository. Instead, you can download a copy from the HuggingFace model's hub. You need the packages transformers and torch installed in the environment you are using.

%pip install transformers torch

Use the following code to download the model to a folder model:

from transformers import pipeline

model = pipeline("summarization", model="facebook/bart-large-cnn")
model_local_path = 'model'

We can now register this model in the Azure Machine Learning registry:

az ml model create --name $MODEL_NAME --path "model"

Creating the endpoint

We are going to create a batch endpoint named text-summarization-batch where to deploy the HuggingFace model to run text summarization on text files in English.

  1. Decide on the name of the endpoint. The name of the endpoint ends-up in the URI associated with your endpoint. Because of that, batch endpoint names need to be unique within an Azure region. For example, there can be only one batch endpoint with the name mybatchendpoint in westus2.

    In this case, let's place the name of the endpoint in a variable so we can easily reference it later.

  2. Configure your batch endpoint

    The following YAML file defines a batch endpoint:


    name: text-summarization-batch
    description: A batch endpoint for summarizing text using a HuggingFace transformer model.
    auth_mode: aad_token
  3. Create the endpoint:

    az ml batch-endpoint create --file endpoint.yml  --name $ENDPOINT_NAME

Creating the deployment

Let's create the deployment that hosts the model:

  1. We need to create a scoring script that can read the CSV files provided by the batch deployment and return the scores of the model with the summary. The following script performs these actions:

    • Indicates an init function that detects the hardware configuration (CPU vs GPU) and loads the model accordingly. Both the model and the tokenizer are loaded in global variables. We are not using a pipeline object from HuggingFace to account for the limitation in the sequence lenghs of the model we are currently using.
    • Notice that we are doing performing model optimizations to improve the performance using optimum and accelerate libraries. If the model or hardware doesn't support it, we will run the deployment without such optimizations.
    • Indicates a run function that is executed for each mini-batch the batch deployment provides.
    • The run function read the entire batch using the datasets library. The text we need to summarize is on the column text.
    • The run method iterates over each of the rows of the text and run the prediction. Since this is a very expensive model, running the prediction over entire files will result in an out-of-memory exception. Notice that the model is not execute with the pipeline object from transformers. This is done to account for long sequences of text and the limitation of 1024 tokens in the underlying model we are using.
    • It returns the summary of the provided text.


    import os
    import time
    import torch
    import subprocess
    import mlflow
    from pprint import pprint
    from transformers import AutoTokenizer, BartForConditionalGeneration
    from optimum.bettertransformer import BetterTransformer
    from datasets import load_dataset
    def init():
        global model
        global tokenizer
        global device
        cuda_available = torch.cuda.is_available()
        device = "cuda" if cuda_available else "cpu"
        if cuda_available:
            print(f"[INFO] CUDA version: {torch.version.cuda}")
            print(f"[INFO] ID of current CUDA device: {torch.cuda.current_device()}")
            print("[INFO] nvidia-smi output:")
      ["nvidia-smi"], stdout=subprocess.PIPE).stdout.decode(
                "[WARN] CUDA acceleration is not available. This model takes hours to run on medium size data."
        # AZUREML_MODEL_DIR is an environment variable created during deployment
        model_path = os.path.join(os.environ["AZUREML_MODEL_DIR"], "model")
        # load the tokenizer
        tokenizer = AutoTokenizer.from_pretrained(
            model_path, truncation=True, max_length=1024
        # Load the model
            model = BartForConditionalGeneration.from_pretrained(
                model_path, device_map="auto"
        except Exception as e:
                f"[ERROR] Error happened when loading the model on GPU or the default device. Error: {e}"
            print("[INFO] Trying on CPU.")
            model = BartForConditionalGeneration.from_pretrained(model_path)
            device = "cpu"
        # Optimize the model
        if device != "cpu":
                model = BetterTransformer.transform(model, keep_original_model=False)
                print("[INFO] BetterTransformer loaded.")
            except Exception as e:
                    f"[ERROR] Error when converting to BetterTransformer. An unoptimized version of the model will be used.\n\t> {e}"
        mlflow.log_param("device", device)
        mlflow.log_param("model", type(model).__name__)
    def run(mini_batch):
        resultList = []
        print(f"[INFO] Reading new mini-batch of {len(mini_batch)} file(s).")
        ds = load_dataset("csv", data_files={"score": mini_batch})
        start_time = time.perf_counter()
        for idx, text in enumerate(ds["score"]["text"]):
            # perform inference
            inputs = tokenizer.batch_encode_plus(
                [text], truncation=True, padding=True, max_length=1024, return_tensors="pt"
            input_ids = inputs["input_ids"].to(device)
            summary_ids = model.generate(
                input_ids, max_length=130, min_length=30, do_sample=False
            summaries = tokenizer.batch_decode(
                summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
            # Get results:
            rps = idx / (time.perf_counter() - start_time + 00000.1)
            print("Rows per second:", rps)
        mlflow.log_metric("rows_per_second", rps)
        return resultList


    Although files are provided in mini-batches by the deployment, this scoring script processes one row at a time. This is a common pattern when dealing with expensive models (like transformers) as trying to load the entire batch and send it to the model at once may result in high-memory pressure on the batch executor (OOM exeptions).

  2. We need to indicate over which environment we are going to run the deployment. In our case, our model runs on Torch and it requires the libraries transformers, accelerate, and optimum from HuggingFace. Azure Machine Learning already has an environment with Torch and GPU support available. We are just going to add a couple of dependencies in a conda.yaml file.


    name: huggingface-env
      - conda-forge
      - python=3.8.5
      - pip
      - pip:
        - torch==2.0
        - transformers
        - accelerate
        - optimum
        - datasets
        - mlflow
        - azureml-mlflow
        - azureml-core
        - azureml-dataset-runtime[fuse]
  3. We can use the conda file mentioned before as follows:

    The environment definition is included in the deployment file.


    compute: azureml:gpu-cluster
      name: torch200-transformers-gpu


    The environment torch200-transformers-gpu we've created requires a CUDA 11.8 compatible hardware device to run Torch 2.0 and Ubuntu 20.04. If your GPU device doesn't support this version of CUDA, you can check the alternative torch113-conda.yaml conda environment (also available on the repository), which runs Torch 1.3 over Ubuntu 18.04 with CUDA 10.1. However, acceleration using the optimum and accelerate libraries won't be supported on this configuration.

  4. Each deployment runs on compute clusters. They support both Azure Machine Learning Compute clusters (AmlCompute) or Kubernetes clusters. In this example, our model can benefit from GPU acceleration, which is why we use a GPU cluster.

    az ml compute create -n gpu-cluster --type amlcompute --size STANDARD_NV6 --min-instances 0 --max-instances 2


    You are not charged for compute at this point as the cluster remains at 0 nodes until a batch endpoint is invoked and a batch scoring job is submitted. Learn more about manage and optimize cost for AmlCompute.

  5. Now, let's create the deployment.

    To create a new deployment under the created endpoint, create a YAML configuration like the following. You can check the full batch endpoint YAML schema for extra properties.


    endpoint_name: text-summarization-batch
    name: text-summarization-optimum
    description: A text summarization deployment implemented with HuggingFace and BART architecture with GPU optimization using Optimum.
    type: model
    model: azureml:bart-text-summarization@latest
    compute: azureml:gpu-cluster
      name: torch200-transformers-gpu
      conda_file: environment/torch200-conda.yaml
      code: code
      instance_count: 2
      max_concurrency_per_instance: 1
      mini_batch_size: 1
      output_action: append_row
      output_file_name: predictions.csv
        max_retries: 1
        timeout: 3000
      error_threshold: -1
      logging_level: info

    Then, create the deployment with the following command:

    az ml batch-deployment create --file deployment.yml --endpoint-name $ENDPOINT_NAME --set-default


    You will notice in this deployment a high value in timeout in the parameter retry_settings. The reason for it is due to the nature of the model we are running. This is a very expensive model and inference on a single row may take up to 60 seconds. The timeout parameters controls how much time the Batch Deployment should wait for the scoring script to finish processing each mini-batch. Since our model runs predictions row by row, processing a long file may take time. Also notice that the number of files per batch is set to 1 (mini_batch_size=1). This is again related to the nature of the work we are doing. Processing one file at a time per batch is expensive enough to justify it. You will notice this being a pattern in NLP processing.

  6. Although you can invoke a specific deployment inside of an endpoint, you usually want to invoke the endpoint itself and let the endpoint decide which deployment to use. Such deployment is named the "default" deployment. This gives you the possibility of changing the default deployment and hence changing the model serving the deployment without changing the contract with the user invoking the endpoint. Use the following instruction to update the default deployment:

    az ml batch-endpoint update --name $ENDPOINT_NAME --set defaults.deployment_name=$DEPLOYMENT_NAME
  7. At this point, our batch endpoint is ready to be used.

Testing out the deployment

For testing our endpoint, we are going to use a sample of the dataset BillSum: A Corpus for Automatic Summarization of US Legislation. This sample is included in the repository in the folder data. Notice that the format of the data is CSV and the content to be summarized is under the column text, as expected by the model.

  1. Let's invoke the endpoint:

    JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input data --input-type uri_folder --query name -o tsv)


    The utility jq may not be installed on every installation. You can get instructions in this link.


    Notice that by indicating a local path as an input, the data is uploaded to Azure Machine Learning default's storage account.

  2. A batch job is started as soon as the command returns. You can monitor the status of the job until it finishes:

    az ml job show -n $JOB_NAME --web
  3. Once the deployment is finished, we can download the predictions:

    To download the predictions, use the following command:

    az ml job download --name $JOB_NAME --output-name score --download-path .

Considerations when deploying models that process text

As mentioned in some of the notes along this tutorial, processing text may have some peculiarities that require specific configuration for batch deployments. Take the following consideration when designing the batch deployment:

  • Some NLP models may be very expensive in terms of memory and compute time. If this is the case, consider decreasing the number of files included on each mini-batch. In the example above, the number was taken to the minimum, 1 file per batch. While this may not be your case, take into consideration how many files your model can score at each time. Have in mind that the relationship between the size of the input and the memory footprint of your model may not be linear for deep learning models.
  • If your model can't even handle one file at a time (like in this example), consider reading the input data in rows/chunks. Implement batching at the row level if you need to achieve higher throughput or hardware utilization.
  • Set the timeout value of your deployment accordly to how expensive your model is and how much data you expect to process. Remember that the timeout indicates the time the batch deployment would wait for your scoring script to run for a given batch. If your batch have many files or files with many rows, this impacts the right value of this parameter.

Considerations for MLflow models that process text

The same considerations mentioned above apply to MLflow models. However, since you are not required to provide a scoring script for your MLflow model deployment, some of the recommendations mentioned may require a different approach.

  • MLflow models in Batch Endpoints support reading tabular data as input data, which may contain long sequences of text. See File's types support for details about which file types are supported.
  • Batch deployments calls your MLflow model's predict function with the content of an entire file in as Pandas dataframe. If your input data contains many rows, chances are that running a complex model (like the one presented in this tutorial) results in an out-of-memory exception. If this is your case, you can consider: