Prepare data for computer vision tasks with automated machine learning v1

APPLIES TO: Python SDK azureml v1

Important

Some of the Azure CLI commands in this article use the azure-cli-ml, or v1, extension for Azure Machine Learning. Support for the v1 extension will end on September 30, 2025. You will be able to install and use the v1 extension until that date.

We recommend that you transition to the ml, or v2, extension before September 30, 2025. For more information on the v2 extension, see Azure ML CLI extension and Python SDK v2.

Important

Support for training computer vision models with automated ML in Azure Machine Learning is an experimental public preview feature. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

In this article, you learn how to prepare image data for training computer vision models with automated machine learning in Azure Machine Learning.

To generate models for computer vision tasks with AutoML, you need to bring labeled image data as input for model training in the form of an Azure Machine Learning TabularDataset.

To ensure your TabularDataset contains the accepted schema for consumption in automated ML, you can use the Azure Machine Learning data labeling tool or use a conversion script.

Prerequisites

Azure Machine Learning data labeling

If you don't have labeled data, you can use Azure Machine Learning's data labeling tool to manually label images. This tool automatically generates the data required for training in the accepted format.

It helps to create, manage, and monitor data labeling tasks for

  • Image classification (multi-class and multi-label)
  • Object detection (bounding box)
  • Instance segmentation (polygon)

If you already have a data labeling project and you want to use that data, you can export your labeled data as an Azure Machine Learning TabularDataset, which can then be used directly with automated ML for training computer vision models.

Use conversion scripts

If you have labeled data in popular computer vision data formats, like VOC or COCO, helper scripts to generate JSONL files for training and validation data are available in notebook examples.

If your data doesn't follow any of the previously mentioned formats, you can use your own script to generate JSON Lines files. To generate JSON Lines files, use schemas defined in Schema for JSONL files for AutoML image experiments.

After your data files are converted to the accepted JSONL format, you can upload them to your storage account on Azure.

Upload the JSONL file and images to storage

To use the data for automated ML training, upload the data to your Azure Machine Learning workspace via a datastore. The datastore provides a mechanism for you to upload/download data to storage on Azure, and interact with it from your remote compute targets.

Upload the entire parent directory consisting of images and JSONL files to the default datastore that is automatically created upon workspace creation. This datastore connects to the default Azure blob storage container that was created as part of workspace creation.

# Retrieve default datastore that's automatically created when we setup a workspace
ds = ws.get_default_datastore()
ds.upload(src_dir='./fridgeObjects', target_path='fridgeObjects')

Once the data upload is done, you can create an Azure Machine Learning TabularDataset. Then, register the dataset to your workspace for future use as input to your automated ML experiments for computer vision models.

from azureml.core import Dataset
from azureml.data import DataType

training_dataset_name = 'fridgeObjectsTrainingDataset'
# create training dataset
training_dataset = Dataset.Tabular.from_json_lines_files(path=ds.path("fridgeObjects/train_annotations.jsonl"),
                                                         set_column_types={"image_url": DataType.to_stream(ds.workspace)}
                                                        )
training_dataset = training_dataset.register( workspace=ws,name=training_dataset_name)

print("Training dataset name: " + training_dataset.name)

Next steps