Labeling data to enable accurate AI model evaluation
Data Labeling is the process of adding metadata and information to existing data. It helps to enrich existing data and is useful for downstream processes to act on. Although it is not necessary for every project, a data catalog is helpful for projects involving unstructured blob data such as images, videos, and documents. However, it can also be useful for structured data, such as, JSON, and CSV files.
Why do you need to label data?
Labeling data is primarily done for the following reasons:
- To train a supervised model to enable classification of unseen new data.
- To index the labels for a search solution such as Azure AI Search, to enable users to search meaningful terms.
For more information on how to label images in Azure ML, look at Set up an image labeling project.
What options are available for labeling data?
Data labeling can be approached in various ways, including manual labeling within the organization, outsourcing it, or using machine learning to automate it. It can be a very time consuming and expensive exercise and if not done well, will lead to poor performance and accuracy.
Approach | Pros | Cons |
---|---|---|
Within the organization | Users know the data and terminology. Full control Data does not leave the organization |
Team needs to be dedicated to labeling Time consuming Expensive Requires training |
External | Trained professionals Can be done quickly |
Data needs to leave the organization or external access granted Loss of control |
Crowdsourced | Fast Cost efficient |
Data needs to leave the organization or external access granted Loss of control Quality not guaranteed |
Automated via ML | Can be cost efficient Full control Data does not need to leave the organization |
Requires human validation Requires a Data Scientist for best results |
Synthetic data | Protects private data Full control Data does not need to leave the organization |
Requires human validation Requires a Data Scientist for best results Can be time consuming to generate realistic and representative data |
Programmatic labeling | Full control Data does not need to leave the organization |
Requires a Data Scientist for best results Requires development effort |
Within the organization
This approach allows an organization to have full control of the labeling process. Sensitive data need not to leave the organization but it requires resources to be trained and dedicated to this task.
Some data labeling tools
Tool/Service | Modality | Options |
---|---|---|
Azure Machine Learning Data Labeling | Images Text |
AutoML and automatic labeling |
Label Studio | Images Text Audio Time-Series |
Custom UI |
Sloth | Images Video |
|
Doccano | Text | Tactical Radio Audio Processing (MORSE) |
Label Box | Text Images Audio Medical Imagery Geospatial Video |
Industry-specific solutions |
Playment | Text Video 3d Sensor Audio Geospatial |
Synthetic data |
Light Tag | Text |
|
SuperAnnotate | Text Video Images |
Full end to end pipelines |
CVAT | Video Images |
|
DataTurks | Text Video Images |
|
spaCy Explosion | Text | |
v7 | Images | |
Supervisely | Images | |
Universal Data Tool | Images Text Audio Video |
No installation |
DataLoop | Images Video 3D Sensor |
|
Yolo Mark | Images | |
Pixel Annotation Tool | Images | |
Open Labeling | Images Video |
|
Med Tagger | Images Medical Imagery |
|
Semi auto-image annotation tool | Images |
External
With this option, labeling can be outsourced to an external company. It can be cost effective and quick, but external access must be granted to the data.
Some External and Crowdsourced labeling companies
Service | Modality |
---|---|
Azure Machine Learning Vendor Labeling | Images Text |
v7 | Images |
Amazon Mechanical Turk | Images Text |
ClickWorker | Images Video Audio |
Appen | Images Video Audio |
Automated via ML
This approach uses unsupervised ML approaches to cluster the data together and then requires human input to assess the clusters. After organizing the data into meaningful clusters, labeling the clusters with relevant information becomes easy, and the labels can be extended to all the records within the clusters.
Enriching the data with meaningful labels has the advantage of quickly indexing the information using Azure AI Search or training a supervised model. For more detail, refer to the Exploratory Data Analysis phase
The Data Discovery solution uses unsupervised ML techniques to cluster large amounts of unstructured data automatically so that a domain expert can quickly label the cluster and the underlying records.
Synthetic data generation
Synthetic data is artificially generated data that is representative of real data. It can be beneficial for many reasons, for example PII data and other sensitive data can be removed or obfuscated to share more broadly. Synthetic data generation requires development efforts but offers much flexibility.
Programmatic labeling
This approach entails taking ground truth values and programmatically labeling the data as opposed to manual labeling. Thus it can be a way to label a large amount of data quickly.
An example scenario would be if a customer wants to train a Form Recognizer model to extract values from a form. Form Recognizer utilizes transfer learning to train the underlying model, requiring only a few labeled forms per form type. However, when dealing with thousands of form types, a considerable amount of manual labeling is still necessary.
With programmatic labeling, the extracted ground truth values can be mapped to the existing forms and thus the labeling process can be automated.
The image below illustrates the overall process:
For assets and guidance, refer to the Auto-Labelling section of the auto-labeling section of the forms knowledge extraction repo