Labeling data to enable accurate AI model evaluation

Data Labeling is the process of adding metadata and information to existing data. It helps to enrich existing data and is useful for downstream processes to act on. Although it is not necessary for every project, a data catalog is helpful for projects involving unstructured blob data such as images, videos, and documents. However, it can also be useful for structured data, such as, JSON, and CSV files.

Why do you need to label data?

Labeling data is primarily done for the following reasons:

  • To train a supervised model to enable classification of unseen new data.
  • To index the labels for a search solution such as Azure AI Search, to enable users to search meaningful terms.

For more information on how to label images in Azure ML, look at Set up an image labeling project.

What options are available for labeling data?

Data labeling can be approached in various ways, including manual labeling within the organization, outsourcing it, or using machine learning to automate it. It can be a very time consuming and expensive exercise and if not done well, will lead to poor performance and accuracy.

Approach Pros Cons
Within the organization Users know the data and terminology.
Full control
Data does not leave the organization
Team needs to be dedicated to labeling
Time consuming
Expensive
Requires training
External Trained professionals
Can be done quickly
Data needs to leave the organization or external access granted
Loss of control
Crowdsourced Fast
Cost efficient
Data needs to leave the organization or external access granted
Loss of control
Quality not guaranteed
Automated via ML Can be cost efficient
Full control
Data does not need to leave the organization
Requires human validation
Requires a Data Scientist for best results
Synthetic data Protects private data
Full control
Data does not need to leave the organization
Requires human validation
Requires a Data Scientist for best results
Can be time consuming to generate realistic and representative data
Programmatic labeling Full control
Data does not need to leave the organization
Requires a Data Scientist for best results
Requires development effort

Within the organization

This approach allows an organization to have full control of the labeling process. Sensitive data need not to leave the organization but it requires resources to be trained and dedicated to this task.

Some data labeling tools

Tool/Service Modality Options
Azure Machine Learning Data Labeling Images
Text
AutoML and automatic labeling
Label Studio Images
Text
Audio
Time-Series
Custom UI
Sloth Images
Video
Doccano Text Tactical Radio Audio Processing (MORSE)
Label Box Text
Images
Audio
Medical Imagery
Geospatial
Video
Industry-specific solutions
Playment Text
Video
3d Sensor
Audio
Geospatial
Synthetic data
Light Tag Text
SuperAnnotate Text
Video
Images
Full end to end pipelines
CVAT Video
Images
DataTurks Text
Video
Images
spaCy Explosion Text
v7 Images
Supervisely Images
Universal Data Tool Images
Text
Audio
Video
No installation
DataLoop Images
Video
3D Sensor
Yolo Mark Images
Pixel Annotation Tool Images
Open Labeling Images
Video
Med Tagger Images
Medical Imagery
Semi auto-image annotation tool Images

External

With this option, labeling can be outsourced to an external company. It can be cost effective and quick, but external access must be granted to the data.

Some External and Crowdsourced labeling companies

Service Modality
Azure Machine Learning Vendor Labeling Images
Text
v7 Images
Amazon Mechanical Turk Images
Text
ClickWorker Images
Video
Audio
Appen Images
Video
Audio

Automated via ML

This approach uses unsupervised ML approaches to cluster the data together and then requires human input to assess the clusters. After organizing the data into meaningful clusters, labeling the clusters with relevant information becomes easy, and the labels can be extended to all the records within the clusters.

Enriching the data with meaningful labels has the advantage of quickly indexing the information using Azure AI Search or training a supervised model. For more detail, refer to the Exploratory Data Analysis phase

The Data Discovery solution uses unsupervised ML techniques to cluster large amounts of unstructured data automatically so that a domain expert can quickly label the cluster and the underlying records.

Synthetic data generation

Synthetic data is artificially generated data that is representative of real data. It can be beneficial for many reasons, for example PII data and other sensitive data can be removed or obfuscated to share more broadly. Synthetic data generation requires development efforts but offers much flexibility.

Programmatic labeling

This approach entails taking ground truth values and programmatically labeling the data as opposed to manual labeling. Thus it can be a way to label a large amount of data quickly.

An example scenario would be if a customer wants to train a Form Recognizer model to extract values from a form. Form Recognizer utilizes transfer learning to train the underlying model, requiring only a few labeled forms per form type. However, when dealing with thousands of form types, a considerable amount of manual labeling is still necessary.

With programmatic labeling, the extracted ground truth values can be mapped to the existing forms and thus the labeling process can be automated.

The image below illustrates the overall process:

Programmatic labeling

For assets and guidance, refer to the Auto-Labelling section of the auto-labeling section of the forms knowledge extraction repo