Get started with object detection on Azure


You can create an object detection machine learning model by using advanced deep learning techniques. However, this approach requires significant expertise and a large volume of training data. The Azure AI Custom Vision service in Azure enables you to create object detection models that meet the needs of many computer vision scenarios with minimal deep learning expertise and fewer training images.

Azure resources for Azure AI Custom Vision

Creating an object detection solution with Azure AI Custom Vision consists of three main tasks:

  • Upload and tag images
  • Train the model
  • Publish the trained model so client applications can use it to generate predictions

For each of these tasks, you need a resource in your Azure subscription. You can use the following types of resource:

  • Custom Vision: A dedicated resource for Azure AI Custom Vision. You can create a training resource, a prediction resource, or both training and prediction resource.
  • Azure AI services: A general resource that includes Azure AI Custom Vision along with many other Azure AI services. You can use this type of resource for training, prediction, or both.

The separation of training and prediction resources is useful when you want to track resource utilization for model training separately from client applications using the model to predict image classes. However, it can make development of an image classification solution a little confusing.

The simplest approach is to use a general Azure AI services resource for both training and prediction. When you use a general resource, you only need to concern yourself with one endpoint (the HTTP address at which your service is hosted) and key (a secret value used by client applications to authenticate themselves).

If you choose to create a Custom Vision resource, you can choose training, prediction, or both. It's important to note that if you choose "both", then two resources are created - one for training and one for prediction.

It's also possible to take a mix-and-match approach in which you use a dedicated Custom Vision resource for training, but deploy your model to an Azure AI services resource for prediction. With this approach, make sure the training and prediction resources are created in the same region.

Image tagging

Before you can train an object detection model, you must tag the classes and bounding box coordinates in a set of training images. This process can be time-consuming, but the Custom Vision portal provides a graphical interface that makes it straightforward. The interface can automatically detect discrete objects in the image and suggest those areas to you. You can apply a class label to these suggested bounding boxes or drag to adjust the bounding box area. Additionally, after tagging and training with an initial dataset, Azure AI Computer Vision can use smart tagging to suggest classes and bounding boxes for images you add to the training dataset.

Keep in mind a few key considerations when tagging training images for object detection. Ensure that you have sufficient images of the objects in question, preferably from multiple angles. It is also important to make sure that the bounding boxes are tight around each object.

Model training and evaluation

To train the model, you can use the Custom Vision portal, or if you have the necessary coding experience you can use one of Azure AI Custom Vision's programming language-specific software development kits (SDKs). Training an object detection model can take some time, depending on the number of training images, classes, and objects within each image.

Model training process is an iterative process. Azure AI Custom Vision repeatedly trains the model using some of the data, but holds some back to evaluate the model. At the end of the training process, you can use the following evaluation metrics to judge the performance of the trained model:

  • Precision: What percentage of class predictions did the model correctly identify? For example, if the model predicted that 10 images are oranges, of which eight were actually oranges, then the precision is 0.8 (80%).
  • Recall: What percentage of the class predictions made by the model were correct? For example, if there are 10 images of apples, and the model found 7 of them, then the recall is 0.7 (70%).
  • Mean Average Precision (mAP): An overall metric that takes into account both precision and recall across all classes.

Using the model for prediction

After you've trained the model, and you're satisfied with its evaluated performance, you can publish the model to your prediction resource. When you publish the model, you can assign it a name (the default is "IterationX", where X is the number of times you have trained the model).

To use your model, client application developers need the following information:

  • Project ID: The unique ID of the Custom Vision project you created to train the model.
  • Model name: The name you assigned to the model during publishing.
  • Prediction endpoint: The HTTP address of the endpoints for the prediction resource to which you published the model (not the training resource).
  • Prediction key: The authentication key for the prediction resource to which you published the model (not the training resource).