Set up a text labeling project and export labels
In Azure Machine Learning, learn how to create and run data labeling projects to label text data. Specify either a single label or multiple labels to apply to each text item.
You can also use the data labeling tool in Azure Machine Learning to create an image labeling project.
Text labeling capabilities
Azure Machine Learning data labeling is a tool you can use to create, manage, and monitor data labeling projects. Use it to:
- Coordinate data, labels, and team members to efficiently manage labeling tasks.
- Track progress and maintain the queue of incomplete labeling tasks.
- Start and stop the project, and control the labeling progress.
- Review and export the labeled data as an Azure Machine Learning dataset.
Important
The text data you work with in the Azure Machine Learning data labeling tool must be available in an Azure Blob Storage datastore. If you don't have an existing datastore, you can upload your data files to a new datastore when you create a project.
These data formats are available for text data:
- .txt: Each file represents one item to be labeled.
- .csv or .tsv: Each row represents one item that's presented to the labeler. You decide which columns the labeler can see when they label the row.
Prerequisites
You use these items to set up text labeling in Azure Machine Learning:
- The data that you want to label, either in local files or in Azure Blob Storage.
- The set of labels that you want to apply.
- The instructions for labeling.
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
- An Azure Machine Learning workspace. See Create an Azure Machine Learning workspace.
Create a text labeling project
Labeling projects are administered in Azure Machine Learning. Use the Data Labeling page in Machine Learning to manage your projects.
If your data is already in Azure Blob Storage, make sure that it's available as a datastore before you create the labeling project.
To create a project, select Add project.
For Project name, enter a name for the project.
You can't reuse the project name, even if you delete the project.
To create a text labeling project, for Media type, select Text.
For Labeling task type, select an option for your scenario:
- To apply only a single label to each piece of text from a set of labels, select Text Classification Multi-class.
- To apply one or more labels to each piece of text from a set of labels, select Text Classification Multi-label.
- To apply labels to individual text words or to multiple text words in each entry, select Text Named Entity Recognition.
Select Next to continue.
Add workforce (optional)
Select Use a vendor labeling company from Azure Marketplace only if you've engaged a data labeling company from Azure Marketplace. Then select the vendor. If your vendor doesn't appear in the list, clear this option.
Make sure that you first contact the vendor and sign a contract. For more information, see Work with a data labeling vendor company (preview).
Select Next to continue.
Select or create a dataset
If you already created a dataset that contains your data, select it in the Select an existing dataset dropdown. You can also select Create a dataset to use an existing Azure datastore or to upload local files.
Note
A project can't contain more than 500,000 files. If your dataset exceeds this file count, only the first 500,000 files are loaded.
Create a dataset from an Azure datastore
In many cases, you can upload local files. However, Azure Storage Explorer provides a faster and more robust way to transfer a large amount of data. We recommend Storage Explorer as the default way to move files.
To create a dataset from data that's already stored in Blob Storage:
- Select Create.
- For Name, enter a name for your dataset. Optionally, enter a description.
- Choose the Dataset type:
- If you're using a .csv or .tsv file and each row contains a response, select Tabular.
- If you're using separate .txt files for each response, select File.
- Select Next.
- Select From Azure storage, and then select Next.
- Select the datastore, and then select Next.
- If your data is in a subfolder within Blob Storage, choose Browse to select the path.
- To include all the files in the subfolders of the selected path, append
/**
to the path. - To include all the data in the current container and its subfolders, append
**/*.*
to the path.
- To include all the files in the subfolders of the selected path, append
- Select Create.
- Select the data asset you created.
Create a dataset from uploaded data
To directly upload your data:
- Select Create.
- For Name, enter a name for your dataset. Optionally, enter a description.
- Choose the Dataset type:
- If you're using a .csv or .tsv file and each row contains a response, select Tabular.
- If you're using separate .txt files for each response, select File.
- Select Next.
- Select From local files, and then select Next.
- (Optional) Select a datastore. The default uploads to the default blob store (workspaceblobstore) for your Machine Learning workspace.
- Select Next.
- Select Upload > Upload files or Upload > Upload folder to select the local files or folders to upload.
- Find your files or folder in the browser window, and then select Open.
- Continue to select Upload until you specify all of your files and folders.
- Optionally select the Overwrite if already exists checkbox. Verify the list of files and folders.
- Select Next.
- Confirm the details. Select Back to modify the settings, or select Create to create the dataset.
- Finally, select the data asset you created.
Configure incremental refresh
If you plan to add new data files to your dataset, use incremental refresh to add the files to your project.
When Enable incremental refresh at regular intervals is set, the dataset is checked periodically for new files to be added to a project based on the labeling completion rate. The check for new data stops when the project contains the maximum 500,000 files.
Select Enable incremental refresh at regular intervals when you want your project to continually monitor for new data in the datastore.
Clear the selection if you don't want new files in the datastore to automatically be added to your project.
Important
Don't create a new version for the dataset you want to update. If you do, the updates won't be seen because the data labeling project is pinned to the initial version. Instead, use Azure Storage Explorer to modify your data in the appropriate folder in Blob Storage.
Also, don't remove data. Removing data from the dataset your project uses causes an error in the project.
After the project is created, use the Details tab to change incremental refresh, view the time stamp for the last refresh, and request an immediate refresh of data.
Note
Projects that use tabular (.csv or .tsv) dataset input can use incremental refresh. But incremental refresh only adds new tabular files. The refresh doesn't recognize changes to existing tabular files.
Specify label categories
On the Label categories page, specify a set of classes to categorize your data.
Your labelers' accuracy and speed are affected by their ability to choose among classes. For instance, instead of spelling out the full genus and species for plants or animals, use a field code or abbreviate the genus.
You can use either a flat list or create groups of labels.
To create a flat list, select Add label category to create each label.
To create labels in different groups, select Add label category to create the top-level labels. Then select the plus sign (+) under each top level to create the next level of labels for that category. You can create up to six levels for any grouping.
You can select labels at any level during the tagging process. For example, the labels Animal
, Animal/Cat
, Animal/Dog
, Color
, Color/Black
, Color/White
, and Color/Silver
are all available choices for a label. In a multi-label project, there's no requirement to pick one of each category. If that is your intent, make sure to include this information in your instructions.
Describe the text labeling task
It's important to clearly explain the labeling task. On the Labeling instructions page, you can add a link to an external site that has labeling instructions, or you can provide instructions in the edit box on the page. Keep the instructions task-oriented and appropriate to the audience. Consider these questions:
- What are the labels labelers will see, and how will they choose among them? Is there a reference text to refer to?
- What should they do if no label seems appropriate?
- What should they do if multiple labels seem appropriate?
- What confidence threshold should they apply to a label? Do you want the labeler's best guess if they aren't certain?
- What should they do with partially occluded or overlapping objects of interest?
- What should they do if an object of interest is clipped by the edge of the image?
- What should they do if they think they made a mistake after they submit a label?
- What should they do if they discover image quality issues, including poor lighting conditions, reflections, loss of focus, undesired background included, abnormal camera angles, and so on?
- What should they do if multiple reviewers have different opinions about applying a label?
Note
Labelers can select the first nine labels by using number keys 1 through 9.
Quality control (preview)
To get more accurate labels, use the Quality control page to send each item to multiple labelers.
Important
Consensus labeling is currently in public preview.
The preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
To have each item sent to multiple labelers, select Enable consensus labeling (preview). Then set values for Minimum labelers and Maximum labelers to specify how many labelers to use. Make sure that you have as many labelers available as your maximum number. You can't change these settings after the project has started.
If a consensus is reached from the minimum number of labelers, the item is labeled. If a consensus isn't reached, the item is sent to more labelers. If there's no consensus after the item goes to the maximum number of labelers, its status is Needs Review, and the project owner is responsible for labeling the item.
Use ML-assisted data labeling
To accelerate labeling tasks, the ML assisted labeling page can trigger automatic machine learning models. Machine learning (ML)-assisted labeling can handle both file (.txt) and tabular (.csv) text data inputs.
To use ML-assisted labeling:
- Select Enable ML assisted labeling.
- Select the Dataset language for the project. This list shows all languages that the TextDNNLanguages Class supports.
- Specify a compute target to use. If you don't have a compute target in your workspace, this step creates a compute cluster and adds it to your workspace. The cluster is created with a minimum of zero nodes, and it costs nothing when not in use.
More information about ML-assisted labeling
At the start of your labeling project, the items are shuffled into a random order to reduce potential bias. However, the trained model reflects any biases present in the dataset. For example, if 80 percent of your items are of a single class, then approximately 80 percent of the data that's used to train the model lands in that class.
To train the text DNN model that ML-assisted labeling uses, the input text per training example is limited to approximately the first 128 words in the document. For tabular input, all text columns are concatenated before this limit is applied. This practical limit allows the model training to complete in a reasonable amount of time. The actual text in a document (for file input) or set of text columns (for tabular input) can exceed 128 words. The limit pertains only to what the model internally uses during the training process.
The number of labeled items that's required to start assisted labeling isn't a fixed number. This number can vary significantly from one labeling project to another. The variance depends on many factors, including the number of label classes and the label distribution.
When you use consensus labeling, the consensus label is used for training.
Because the final labels still rely on input from the labeler, this technology is sometimes called human-in-the-loop labeling.
Note
ML-assisted data labeling doesn't support default storage accounts that are secured behind a virtual network. You must use a non-default storage account for ML-assisted data labeling. The non-default storage account can be secured behind the virtual network.
Pre-labeling
After you submit enough labels for training, the trained model is used to predict tags. The labeler now sees pages that show predicted labels already present on each item. The task then involves reviewing these predictions and correcting any mislabeled items before page submission.
After you train the machine learning model on your manually labeled data, the model is evaluated on a test set of manually labeled items. The evaluation helps determine the model's accuracy at different confidence thresholds. The evaluation process sets a confidence threshold beyond which the model is accurate enough to show pre-labels. The model is then evaluated against unlabeled data. Items that have predictions that are more confident than the threshold are used for pre-labeling.
Initialize the text labeling project
After the labeling project is initialized, some aspects of the project are immutable. You can't change the task type or dataset. You can modify labels and the URL for the task description. Carefully review the settings before you create the project. After you submit the project, you return to the Data Labeling overview page, which shows the project as Initializing.
Note
This page might not automatically refresh. After a pause, manually refresh the page to see the project's status as Created.
Run and monitor the project
After you initialize the project, Azure begins to run it. To see the project details, select the project on the main Data Labeling page.
To pause or restart the project, on the project command bar, toggle the Running status. You can label data only when the project is running.
Dashboard
The Dashboard tab shows the labeling task progress.
The progress charts show how many items have been labeled, skipped, need review, or aren't yet complete. Hover over the chart to see the number of items in each section.
A distribution of the labels for completed tasks is shown below the chart. In some project types, an item can have multiple labels. The total number of labels can exceed the total number of items.
A distribution of labelers and how many items they've labeled also are shown.
The middle section shows a table that has a queue of unassigned tasks. When ML-assisted labeling is off, this section shows the number of manual tasks that are awaiting assignment.
When ML-assisted labeling is on, this section also shows:
- Tasks that contain clustered items in the queue.
- Tasks that contain pre-labeled items in the queue.
Additionally, when ML-assisted labeling is enabled, you can scroll down to see the ML-assisted labeling status. The Jobs sections give links for each of the machine learning runs.
Data
On the Data tab, you can see your dataset and review labeled data. Scroll through the labeled data to see the labels. If you see data that's incorrectly labeled, select it and choose Reject to remove the labels and return the data to the unlabeled queue.
If your project uses consensus labeling, review items that have no consensus:
Select the Data tab.
On the left menu, select Review labels.
On the command bar above Review labels, select All filters.
Under Labeled datapoints, select Consensus labels in need of review to show only items for which the labelers didn't come to a consensus.
For each item to review, select the Consensus label dropdown to view the conflicting labels.
Although you can select an individual labeler to see their labels, to update or reject the labels, you must use the top choice, Consensus label (preview).
Details tab
View and change details of your project. On this tab, you can:
- View project details and input datasets.
- Set or clear the Enable incremental refresh at regular intervals option, or request an immediate refresh.
- View details of the storage container that's used to store labeled outputs in your project.
- Add labels to your project.
- Edit instructions you give to your labels.
- Change settings for ML-assisted labeling and kick off a labeling task.
Language Studio tab
If your project was created from Language Studio, you'll also see a Language Studio tab.
- If labeling is active in Language Studio, you can't also label in Azure Machine Learning. In that case, Language Studio is the only tab available. Select View in Language Studio to go to the active labeling project in Language Studio. From there, you can switch to labeling in Azure Machine Learning if you wish.
If labeling is active in Azure Machine Learning, you have two choices:
Select Switch to Language Studio to switch your labeling activity back to Language Studio. When you switch, all your currently labeled data is imported into Language Studio. Your ability to label data in Azure Machine Learning is disabled, and you can label data in Language Studio. You can switch back to labeling in Azure Machine Learning at any time through Language Studio.
Note
Only users with the correct roles in Azure Machine Learning have the ability to switch labeling.
Select Disconnect from Language Studio to sever the relationship with Language Studio. Once you disconnect, the project will lose its association with Language Studio, and will no longer have the Language Studio tab. Disconnecting your project from Language Studio is a permanent, irreversible process and can't be undone. You will no longer be able to access your labels for this project in Language Studio. The labels are available only in Azure Machine Learning from this point onward.
Access for labelers
Anyone who has Contributor or Owner access to your workspace can label data in your project.
You can also add users and customize the permissions so that they can access labeling but not other parts of the workspace or your labeling project. For more information, see Add users to your data labeling project.
Add new labels to a project
During the data labeling process, you might want to add more labels to classify your items. For example, you might want to add an Unknown or Other label to indicate confusion.
To add one or more labels to a project:
On the main Data Labeling page, select the project.
On the project command bar, toggle the status from Running to Paused to stop labeling activity.
Select the Details tab.
In the list on the left, select Label categories.
Modify your labels.
In the form, add your new label. Then choose how to continue the project. Because you've changed the available labels, choose how to treat data that's already labeled:
- Start over, and remove all existing labels. Choose this option if you want to start labeling from the beginning by using the new full set of labels.
- Start over, and keep all existing labels. Choose this option to mark all data as unlabeled, but keep the existing labels as a default tag for images that were previously labeled.
- Continue, and keep all existing labels. Choose this option to keep all data already labeled as it is, and start using the new label for data that's not yet labeled.
Modify your instructions page as necessary for new labels.
After you've added all new labels, toggle Paused to Running to restart the project.
Start an ML-assisted labeling task
ML-assisted labeling starts automatically after some items have been labeled. This automatic threshold varies by project. You can manually start an ML-assisted training run if your project contains at least some labeled data.
Note
On-demand training is not available for projects created before December 2022. To use this feature, create a new project.
To start a new ML-assisted training run:
- At the top of your project, select Details.
- On the left menu, select ML assisted labeling.
- Near the bottom of the page, for On-demand training, select Start.
Export the labels
To export the labels, on the Project details page of your labeling project, select the Export button. You can export the label data for Machine Learning experimentation at any time.
For all project types except Text Named Entity Recognition, you can export label data as:
- A CSV file. Azure Machine Learning creates the CSV file in a folder inside Labeling/export/csv.
- An Azure Machine Learning dataset with labels.
- A CSV file. Azure Machine Learning creates the CSV file in a folder inside Labeling/export/csv.
- An Azure MLTable data asset.
For Text Named Entity Recognition projects, you can export label data as:
- An Azure Machine Learning dataset (v1) with labels.
- A CoNLL file. For this export, you'll also have to assign a compute resource. The export process runs offline and generates the file as part of an experiment run. Azure Machine Learning creates the CoNLL file in a folder insideLabeling/export/conll.
- An Azure MLTable data asset.
- A CoNLL file. For this export, you'll also have to assign a compute resource. The export process runs offline and generates the file as part of an experiment run. Azure Machine Learning creates the CoNLL file in a folder insideLabeling/export/conll.
When you export a CSV or CoNLL file, a notification appears briefly when the file is ready to download. Select the Download file link to download your results. You'll also find the notification in the Notification section on the top bar:
Access exported Azure Machine Learning datasets and data assets in the Data section of Machine Learning. The data details page also provides sample code you can use to access your labels by using Python.
Troubleshoot issues
Use these tips if you see any of the following issues:
Issue | Resolution |
---|---|
Only datasets created on blob datastores can be used. | This issue is a known limitation of the current release. |
Removing data from the dataset your project uses causes an error in the project. | Don't remove data from the version of the dataset you used in a labeling project. Create a new version of the dataset to use to remove data. |
After a project is created, the project status is Initializing for an extended time. | Manually refresh the page. Initialization should complete at roughly 20 data points per second. No automatic refresh is a known issue. |
Newly labeled items aren't visible in data review. | To load all labeled items, select the First button. The First button takes you back to the front of the list, and it loads all labeled data. |
You can't assign a set of tasks to a specific labeler. | This issue is a known limitation of the current release. |
Next steps
Feedback
Submit and view feedback for