Set up AutoML to train computer vision models

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article, you learn how to train computer vision models on image data with automated ML with the Azure Machine Learning CLI extension v2 or the Azure Machine Learning Python SDK v2.

Automated ML supports model training for computer vision tasks like image classification, object detection, and instance segmentation. Authoring AutoML models for computer vision tasks is currently supported via the Azure Machine Learning Python SDK. The resulting experimentation runs, models, and outputs are accessible from the Azure Machine Learning studio UI. Learn more about automated ml for computer vision tasks on image data.

Prerequisites

APPLIES TO: Azure CLI ml extension v2 (current)

Select your task type

Automated ML for images supports the following task types:

Task type AutoML Job syntax
image classification CLI v2: image_classification
SDK v2: image_classification()
image classification multi-label CLI v2: image_classification_multilabel
SDK v2: image_classification_multilabel()
image object detection CLI v2: image_object_detection
SDK v2: image_object_detection()
image instance segmentation CLI v2: image_instance_segmentation
SDK v2: image_instance_segmentation()

APPLIES TO: Azure CLI ml extension v2 (current)

This task type is a required parameter and can be set using the task key.

For example:

task: image_object_detection

Training and validation data

In order to generate computer vision models, you need to bring labeled image data as input for model training in the form of an MLTable. You can create an MLTable from training data in JSONL format.

If your training data is in a different format (like, pascal VOC or COCO), you can apply the helper scripts included with the sample notebooks to convert the data to JSONL. Learn more about how to prepare data for computer vision tasks with automated ML.

Note

The training data needs to have at least 10 images in order to be able to submit an AutoML run.

Warning

Creation of MLTable from data in JSONL format is supported using the SDK and CLI only, for this capability. Creating the MLTable via UI is not supported at this time.

JSONL schema samples

The structure of the TabularDataset depends upon the task at hand. For computer vision task types, it consists of the following fields:

Field Description
image_url Contains filepath as a StreamInfo object
image_details Image metadata information consists of height, width, and format. This field is optional and hence may or may not exist.
label A json representation of the image label, based on the task type.

The following is a sample JSONL file for image classification:

{
      "image_url": "azureml://subscriptions/<my-subscription-id>/resourcegroups/<my-resource-group>/workspaces/<my-workspace>/datastores/<my-datastore>/paths/image_data/Image_01.png",
      "image_details":
      {
          "format": "png",
          "width": "2230px",
          "height": "4356px"
      },
      "label": "cat"
  }
  {
      "image_url": "azureml://subscriptions/<my-subscription-id>/resourcegroups/<my-resource-group>/workspaces/<my-workspace>/datastores/<my-datastore>/paths/image_data/Image_02.jpeg",
      "image_details":
      {
          "format": "jpeg",
          "width": "3456px",
          "height": "3467px"
      },
      "label": "dog"
  }

The following code is a sample JSONL file for object detection:

{
    "image_url": "azureml://subscriptions/<my-subscription-id>/resourcegroups/<my-resource-group>/workspaces/<my-workspace>/datastores/<my-datastore>/paths/image_data/Image_01.png",
    "image_details":
    {
        "format": "png",
        "width": "2230px",
        "height": "4356px"
    },
    "label":
    {
        "label": "cat",
        "topX": "1",
        "topY": "0",
        "bottomX": "0",
        "bottomY": "1",
        "isCrowd": "true",
    }
}
{
    "image_url": "azureml://subscriptions/<my-subscription-id>/resourcegroups/<my-resource-group>/workspaces/<my-workspace>/datastores/<my-datastore>/paths/image_data/Image_02.png",
    "image_details":
    {
        "format": "jpeg",
        "width": "1230px",
        "height": "2356px"
    },
    "label":
    {
        "label": "dog",
        "topX": "0",
        "topY": "1",
        "bottomX": "0",
        "bottomY": "1",
        "isCrowd": "false",
    }
}

Consume data

Once your data is in JSONL format, you can create training and validation MLTable as shown below.

paths:
  - file: ./train_annotations.jsonl
transformations:
  - read_json_lines:
        encoding: utf8
        invalid_lines: error
        include_path_column: false
  - convert_column_types:
      - columns: image_url
        column_type: stream_info

Automated ML doesn't impose any constraints on training or validation data size for computer vision tasks. Maximum dataset size is only limited by the storage layer behind the dataset (i.e. blob store). There's no minimum number of images or labels. However, we recommend starting with a minimum of 10-15 samples per label to ensure the output model is sufficiently trained. The higher the total number of labels/classes, the more samples you need per label.

APPLIES TO: Azure CLI ml extension v2 (current)

Training data is a required parameter and is passed in using the training_data key. You can optionally specify another MLtable as a validation data with the validation_data key. If no validation data is specified, 20% of your training data will be used for validation by default, unless you pass validation_data_size argument with a different value.

Target column name is a required parameter and used as target for supervised ML task. It's passed in using the target_column_name key. For example,

target_column_name: label
training_data:
  path: data/training-mltable-folder
  type: mltable
validation_data:
  path: data/validation-mltable-folder
  type: mltable

Compute to run experiment

Provide a compute target for automated ML to conduct model training. Automated ML models for computer vision tasks require GPU SKUs and support NC and ND families. We recommend the NCsv3-series (with v100 GPUs) for faster training. A compute target with a multi-GPU VM SKU leverages multiple GPUs to also speed up training. Additionally, when you set up a compute target with multiple nodes you can conduct faster model training through parallelism when tuning hyperparameters for your model.

Note

If you are using a compute instance as your compute target, please make sure that multiple AutoML jobs are not run at the same time. Also, please make sure that max_concurrent_trials is set to 1 in your job limits.

The compute target is passed in using the compute parameter. For example:

APPLIES TO: Azure CLI ml extension v2 (current)

compute: azureml:gpu-cluster

Configure model algorithms and hyperparameters

With support for computer vision tasks, you can control the model algorithm and sweep hyperparameters. These model algorithms and hyperparameters are passed in as the parameter space for the sweep.

The model algorithm is required and is passed in via model_name parameter. You can either specify a single model_name or choose between multiple.

Supported model algorithms

The following table summarizes the supported models for each computer vision task.

Task Model algorithms String literal syntax
default_model* denoted with *
Image classification
(multi-class and multi-label)
MobileNet: Light-weighted models for mobile applications
ResNet: Residual networks
ResNeSt: Split attention networks
SE-ResNeXt50: Squeeze-and-Excitation networks
ViT: Vision transformer networks
mobilenetv2
resnet18
resnet34
resnet50
resnet101
resnet152
resnest50
resnest101
seresnext
vits16r224 (small)
vitb16r224* (base)
vitl16r224 (large)
Object detection YOLOv5: One stage object detection model
Faster RCNN ResNet FPN: Two stage object detection models
RetinaNet ResNet FPN: address class imbalance with Focal Loss

Note: Refer to model_size hyperparameter for YOLOv5 model sizes.
yolov5*
fasterrcnn_resnet18_fpn
fasterrcnn_resnet34_fpn
fasterrcnn_resnet50_fpn
fasterrcnn_resnet101_fpn
fasterrcnn_resnet152_fpn
retinanet_resnet50_fpn
Instance segmentation MaskRCNN ResNet FPN maskrcnn_resnet18_fpn
maskrcnn_resnet34_fpn
maskrcnn_resnet50_fpn*
maskrcnn_resnet101_fpn
maskrcnn_resnet152_fpn

In addition to controlling the model algorithm, you can also tune hyperparameters used for model training. While many of the hyperparameters exposed are model-agnostic, there are instances where hyperparameters are task-specific or model-specific. Learn more about the available hyperparameters for these instances.

Data augmentation

In general, deep learning model performance can often improve with more data. Data augmentation is a practical technique to amplify the data size and variability of a dataset which helps to prevent overfitting and improve the model’s generalization ability on unseen data. Automated ML applies different data augmentation techniques based on the computer vision task, before feeding input images to the model. Currently, there's no exposed hyperparameter to control data augmentations.

Task Impacted dataset Data augmentation technique(s) applied
Image classification (multi-class and multi-label) Training


Validation & Test
Random resize and crop, horizontal flip, color jitter (brightness, contrast, saturation, and hue), normalization using channel-wise ImageNet’s mean and standard deviation


Resize, center crop, normalization
Object detection, instance segmentation Training

Validation & Test
Random crop around bounding boxes, expand, horizontal flip, normalization, resize


Normalization, resize
Object detection using yolov5 Training

Validation & Test
Mosaic, random affine (rotation, translation, scale, shear), horizontal flip


Letterbox resizing

Configure your experiment settings

Before doing a large sweep to search for the optimal models and hyperparameters, we recommend trying the default values to get a first baseline. Next, you can explore multiple hyperparameters for the same model before sweeping over multiple models and their parameters. This way, you can employ a more iterative approach, because with multiple models and multiple hyperparameters for each, the search space grows exponentially and you need more iterations to find optimal configurations.

APPLIES TO: Azure CLI ml extension v2 (current)

If you wish to use the default hyperparameter values for a given algorithm (say yolov5), you can specify it using model_name key in training_parameters section. For example,

training_parameters:
    model_name: yolov5

Once you've built a baseline model, you might want to optimize model performance in order to sweep over the model algorithm and hyperparameter space. You can use the following sample config to sweep over the hyperparameters for each algorithm, choosing from a range of values for learning_rate, optimizer, lr_scheduler, etc., to generate a model with the optimal primary metric. If hyperparameter values aren't specified, then default values are used for the specified algorithm.

Primary metric

The primary metric used for model optimization and hyperparameter tuning depends on the task type. Using other primary metric values is currently not supported.

  • accuracy for IMAGE_CLASSIFICATION
  • iou for IMAGE_CLASSIFICATION_MULTILABEL
  • mean_average_precision for IMAGE_OBJECT_DETECTION
  • mean_average_precision for IMAGE_INSTANCE_SEGMENTATION

Job Limits

You can control the resources spent on your AutoML Image training job by specifying the timeout_minutes, max_trials and the max_concurrent_trials for the job in limit settings as described in the below example.

Parameter Detail
max_trials Parameter for maximum number of configurations to sweep. Must be an integer between 1 and 1000. When exploring just the default hyperparameters for a given model algorithm, set this parameter to 1. default value is 1.
max_concurrent_trials Maximum number of runs that can run concurrently. If not specified, all runs launch in parallel. If specified, must be an integer between 1 and 100.

NOTE: The number of concurrent runs is gated on the resources available in the specified compute target. Ensure that the compute target has the available resources for the desired concurrency. default value is 1.
timeout_minutes The amount of time in minutes before the experiment terminates. If none specified, default experiment timeout_minutes is seven days (maximum 60 days)

APPLIES TO: Azure CLI ml extension v2 (current)

limits:
  timeout_minutes: 60
  max_trials: 10
  max_concurrent_trials: 2

Sweeping hyperparameters for your model

When training computer vision models, model performance depends heavily on the hyperparameter values selected. Often, you might want to tune the hyperparameters to get optimal performance. With support for computer vision tasks in automated ML, you can sweep hyperparameters to find the optimal settings for your model. This feature applies the hyperparameter tuning capabilities in Azure Machine Learning. Learn how to tune hyperparameters.

APPLIES TO: Azure CLI ml extension v2 (current)

search_space:
  - model_name:
      type: choice
      values: [yolov5]
    learning_rate:
      type: uniform
      min_value: 0.0001
      max_value: 0.01
    model_size:
      type: choice
      values: [small, medium]

  - model_name:
      type: choice
      values: [fasterrcnn_resnet50_fpn]
    learning_rate:
      type: uniform
      min_value: 0.0001
      max_value: 0.001
    optimizer:
      type: choice
      values: [sgd, adam, adamw]
    min_size:
      type: choice
      values: [600, 800]

Define the parameter search space

You can define the model algorithms and hyperparameters to sweep in the parameter space.

Sampling methods for the sweep

When sweeping hyperparameters, you need to specify the sampling method to use for sweeping over the defined parameter space. Currently, the following sampling methods are supported with the sampling_algorithm parameter:

Sampling type AutoML Job syntax
Random Sampling random
Grid Sampling grid
Bayesian Sampling bayesian

Note

Currently only random and grid sampling support conditional hyperparameter spaces.

Early termination policies

You can automatically end poorly performing runs with an early termination policy. Early termination improves computational efficiency, saving compute resources that would have been otherwise spent on less promising configurations. Automated ML for images supports the following early termination policies using the early_termination parameter. If no termination policy is specified, all configurations are run to completion.

Early termination policy AutoML Job syntax
Bandit policy CLI v2: bandit
SDK v2: BanditPolicy()
Median stopping policy CLI v2: median_stopping
SDK v2: MedianStoppingPolicy()
Truncation selection policy CLI v2: truncation_selection
SDK v2: TruncationSelectionPolicy()

Learn more about how to configure the early termination policy for your hyperparameter sweep.

Note

For a complete sweep configuration sample, please refer to this tutorial.

You can configure all the sweep related parameters as shown in the example below.

APPLIES TO: Azure CLI ml extension v2 (current)

sweep:
  sampling_algorithm: random
  early_termination:
    type: bandit
    evaluation_interval: 2
    slack_factor: 0.2
    delay_evaluation: 6

Fixed settings

You can pass fixed settings or parameters that don't change during the parameter space sweep as shown below.

APPLIES TO: Azure CLI ml extension v2 (current)

training_parameters:
  early_stopping: True
  evaluation_frequency: 1

Incremental training (optional)

Once the training run is done, you have the option to further train the model by loading the trained model checkpoint. You can either use the same dataset or a different one for incremental training.

Pass the checkpoint via run ID

You can pass the run ID that you want to load the checkpoint from.

APPLIES TO: Azure CLI ml extension v2 (current)

training_parameters:
  checkpoint_run_id : "target_checkpoint_run_id"

Submit the AutoML job

APPLIES TO: Azure CLI ml extension v2 (current)

To submit your AutoML job, you run the following CLI v2 command with the path to your .yml file, workspace name, resource group and subscription ID.

az ml job create --file ./hello-automl-job-basic.yml --workspace-name [YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]

Outputs and evaluation metrics

The automated ML training runs generates output model files, evaluation metrics, logs and deployment artifacts like the scoring file and the environment file which can be viewed from the outputs and logs and metrics tab of the child runs.

Tip

Check how to navigate to the run results from the View run results section.

For definitions and examples of the performance charts and metrics provided for each run, see Evaluate automated machine learning experiment results.

Register and deploy model

Once the run completes, you can register the model that was created from the best run (configuration that resulted in the best primary metric). You can either register the model after downloading or by specifying the azureml path with corresponding jobid. Note: If you want to change the inference settings that are described below you need to download the model and change settings.json and register using the updated model folder.

Get the best run

APPLIES TO: Azure CLI ml extension v2 (current)

CLI example not available, please use Python SDK.

register the model

Register the model either using the azureml path or your locally downloaded path.

APPLIES TO: Azure CLI ml extension v2 (current)

 az ml model create --name od-fridge-items-mlflow-model --version 1 --path azureml://jobs/$best_run/outputs/artifacts/outputs/mlflow-model/ --type mlflow_model --workspace-name [YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]

After you register the model you want to use, you can deploy it using the managed online endpoint deploy-managed-online-endpoint

Configure online endpoint

APPLIES TO: Azure CLI ml extension v2 (current)

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: od-fridge-items-endpoint
auth_mode: key

Create the endpoint

Using the MLClient created earlier, we'll now create the Endpoint in the workspace. This command will start the endpoint creation and return a confirmation response while the endpoint creation continues.

APPLIES TO: Azure CLI ml extension v2 (current)

az ml online-endpoint create --file .\create_endpoint.yml --workspace-name [YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]

Configure online deployment

A deployment is a set of resources required for hosting the model that does the actual inferencing. We'll create a deployment for our endpoint using the ManagedOnlineDeployment class. You can use either GPU or CPU VM SKUs for your deployment cluster.

APPLIES TO: Azure CLI ml extension v2 (current)

name: od-fridge-items-mlflow-deploy
endpoint_name: od-fridge-items-endpoint
model: azureml:od-fridge-items-mlflow-model@latest
instance_type: Standard_DS3_v2
instance_count: 1
liveness_probe:
    failure_threshold: 30
    success_threshold: 1
    timeout: 2
    period: 10
    initial_delay: 2000
readiness_probe:
    failure_threshold: 10
    success_threshold: 1
    timeout: 10
    period: 10
    initial_delay: 2000 

Create the deployment

Using the MLClient created earlier, we'll now create the deployment in the workspace. This command will start the deployment creation and return a confirmation response while the deployment creation continues.

APPLIES TO: Azure CLI ml extension v2 (current)

az ml online-deployment create --file .\create_deployment.yml --workspace-name [YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]

update traffic:

By default the current deployment is set to receive 0% traffic. you can set the traffic percentage current deployment should receive. Sum of traffic percentages of all the deployments with one end point shouldn't exceed 100%.

APPLIES TO: Azure CLI ml extension v2 (current)

az ml online-endpoint update --name 'od-fridge-items-endpoint' --traffic 'od-fridge-items-mlflow-deploy=100' --workspace-name [YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]

Alternatively You can deploy the model from the Azure Machine Learning studio UI. Navigate to the model you wish to deploy in the Models tab of the automated ML run and select on Deploy and select Deploy to real-time endpoint .

Screenshot of how the Deployment page looks like after selecting the Deploy option..

this is how your review page looks like. we can select instance type, instance count and set traffic percentage for the current deployment.

Screenshot of how the top of review page looks like after selecting the options to deploy.. Screenshot of how the bottom of review page looks like after selecting the options to deploy..

Update inference settings

In the previous step, we downloaded a file mlflow-model/artifacts/settings.json from the best model. which can be used to update the inference settings before registering the model. Although it's recommended to use the same parameters as training for best performance.

Each of the tasks (and some models) has a set of parameters. By default, we use the same values for the parameters that were used during the training and validation. Depending on the behavior that we need when using the model for inference, we can change these parameters. Below you can find a list of parameters for each task type and model.

Task Parameter name Default
Image classification (multi-class and multi-label) valid_resize_size
valid_crop_size
256
224
Object detection min_size
max_size
box_score_thresh
nms_iou_thresh
box_detections_per_img
600
1333
0.3
0.5
100
Object detection using yolov5 img_size
model_size
box_score_thresh
nms_iou_thresh
640
medium
0.1
0.5
Instance segmentation min_size
max_size
box_score_thresh
nms_iou_thresh
box_detections_per_img
mask_pixel_score_threshold
max_number_of_polygon_points
export_as_image
image_type
600
1333
0.3
0.5
100
0.5
100
False
JPG

For a detailed description on task specific hyperparameters, please refer to Hyperparameters for computer vision tasks in automated machine learning.

If you want to use tiling, and want to control tiling behavior, the following parameters are available: tile_grid_size, tile_overlap_ratio and tile_predictions_nms_thresh. For more details on these parameters please check Train a small object detection model using AutoML.

Test the deployment

Please check this Test the deployment section to test the deployment and visualize the detections from the model.

Large datasets

If you're using AutoML to train on large datasets, there are some experimental settings that may be useful.

Important

These settings are currently in public preview. They are provided without a service-level agreement. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Multi-GPU and multi-node training

By default, each model trains on a single VM. If training a model is taking too much time, using VMs that contain multiple GPUs may help. The time to train a model on large datasets should decrease in roughly linear proportion to the number of GPUs used. (For instance, a model should train roughly twice as fast on a VM with two GPUs as on a VM with one GPU.) If the time to train a model is still high on a VM with multiple GPUs, you can increase the number of VMs used to train each model. Similar to multi-GPU training, the time to train a model on large datasets should also decrease in roughly linear proportion to the number of VMs used. When training a model across multiple VMs, be sure to use a compute SKU that supports InfiniBand for best results. You can configure the number of VMs used to train a single model by setting the node_count_per_trial property of the AutoML job.

APPLIES TO: Azure CLI ml extension v2 (current)

properties:
  node_count_per_trial: "2"

Streaming image files from storage

By default, all image files are downloaded to disk prior to model training. If the size of the image files is greater than available disk space, the run will fail. Instead of downloading all images to disk, you can select to stream image files from Azure storage as they're needed during training. Image files are streamed from Azure storage directly to system memory, bypassing disk. At the same time, as many files as possible from storage are cached on disk to minimize the number of requests to storage.

Note

If streaming is enabled, ensure the Azure storage account is located in the same region as compute to minimize cost and latency.

APPLIES TO: Azure CLI ml extension v2 (current)

training_parameters:
  advanced_settings: >
    {"stream_image_files": true}

Example notebooks

Review detailed code examples and use cases in the GitHub notebook repository for automated machine learning samples. Please check the folders with 'automl-image-' prefix for samples specific to building computer vision models.

Code examples

Next steps