Labeling data to enable accurate AI model evaluation

Article
06/26/2024

Data Labeling is the process of adding metadata and information to existing data. It helps to enrich existing data and is useful for downstream processes to act on. Although it is not necessary for every project, a data catalog is helpful for projects involving unstructured blob data such as images, videos, and documents. However, it can also be useful for structured data, such as, JSON, and CSV files.

Why do you need to label data?

Labeling data is primarily done for the following reasons:

To train a supervised model to enable classification of unseen new data.
To index the labels for a search solution such as Azure AI Search, to enable users to search meaningful terms.

For more information on how to label images in Azure ML, look at Set up an image labeling project.

What options are available for labeling data?

Data labeling can be approached in various ways, including manual labeling within the organization, outsourcing it, or using machine learning to automate it. It can be a very time consuming and expensive exercise and if not done well, will lead to poor performance and accuracy.

Approach	Pros	Cons
Within the organization	Users know the data and terminology. Full control Data does not leave the organization	Team needs to be dedicated to labeling Time consuming Expensive Requires training
External	Trained professionals Can be done quickly	Data needs to leave the organization or external access granted Loss of control
Crowdsourced	Fast Cost efficient	Data needs to leave the organization or external access granted Loss of control Quality not guaranteed
Automated via ML	Can be cost efficient Full control Data does not need to leave the organization	Requires human validation Requires a Data Scientist for best results
Synthetic data	Protects private data Full control Data does not need to leave the organization	Requires human validation Requires a Data Scientist for best results Can be time consuming to generate realistic and representative data
Programmatic labeling	Full control Data does not need to leave the organization	Requires a Data Scientist for best results Requires development effort

Within the organization

This approach allows an organization to have full control of the labeling process. Sensitive data need not to leave the organization but it requires resources to be trained and dedicated to this task.

Some data labeling tools

Tool/Service	Modality	Options
Azure Machine Learning Data Labeling	Images Text	AutoML and automatic labeling
Label Studio	Images Text Audio Time-Series	Custom UI
Sloth	Images Video
Doccano	Text	Tactical Radio Audio Processing (MORSE)
Label Box	Text Images Audio Medical Imagery Geospatial Video	Industry-specific solutions
Playment	Text Video 3d Sensor Audio Geospatial	Synthetic data
Light Tag	Text
SuperAnnotate	Text Video Images	Full end to end pipelines
CVAT	Video Images
DataTurks	Text Video Images
spaCy Explosion	Text
v7	Images
Supervisely	Images
Universal Data Tool	Images Text Audio Video	No installation
DataLoop	Images Video 3D Sensor
Yolo Mark	Images
Pixel Annotation Tool	Images
Open Labeling	Images Video
Med Tagger	Images Medical Imagery
Semi auto-image annotation tool	Images

External

With this option, labeling can be outsourced to an external company. It can be cost effective and quick, but external access must be granted to the data.

Some External and Crowdsourced labeling companies

Service	Modality
Azure Machine Learning Vendor Labeling	Images Text
v7	Images
Amazon Mechanical Turk	Images Text
ClickWorker	Images Video Audio
Appen	Images Video Audio

Automated via ML

This approach uses unsupervised ML approaches to cluster the data together and then requires human input to assess the clusters. After organizing the data into meaningful clusters, labeling the clusters with relevant information becomes easy, and the labels can be extended to all the records within the clusters.

Enriching the data with meaningful labels has the advantage of quickly indexing the information using Azure AI Search or training a supervised model. For more detail, refer to the Exploratory Data Analysis phase

The Data Discovery solution uses unsupervised ML techniques to cluster large amounts of unstructured data automatically so that a domain expert can quickly label the cluster and the underlying records.

Synthetic data generation

Synthetic data is artificially generated data that is representative of real data. It can be beneficial for many reasons, for example PII data and other sensitive data can be removed or obfuscated to share more broadly. Synthetic data generation requires development efforts but offers much flexibility.

Programmatic labeling

This approach entails taking ground truth values and programmatically labeling the data as opposed to manual labeling. Thus it can be a way to label a large amount of data quickly.

An example scenario would be if a customer wants to train a Form Recognizer model to extract values from a form. Form Recognizer utilizes transfer learning to train the underlying model, requiring only a few labeled forms per form type. However, when dealing with thousands of form types, a considerable amount of manual labeling is still necessary.

With programmatic labeling, the extracted ground truth values can be mapped to the existing forms and thus the labeling process can be automated.

The image below illustrates the overall process:

Programmatic labeling

For assets and guidance, refer to the Auto-Labelling section of the auto-labeling section of the forms knowledge extraction repo

Share via

Labeling data to enable accurate AI model evaluation

Why do you need to label data?

What options are available for labeling data?

Within the organization

Some data labeling tools

External

Some External and Crowdsourced labeling companies

Automated via ML

Synthetic data generation

Programmatic labeling

Feedback

Additional resources