Create a text labeling project and export labels

Learn how to create and run data labeling projects to label text data in Azure Machine Learning. Specify either a single label or multiple labels to be applied to each text item.

You can also use the data labeling tool to create an image labeling project.

Text labeling capabilities

Azure Machine Learning data labeling is a central place to create, manage, and monitor data labeling projects:

  • Coordinate data, labels, and team members to efficiently manage labeling tasks.
  • Tracks progress and maintains the queue of incomplete labeling tasks.
  • Start and stop the project and control the labeling progress.
  • Review the labeled data and export labeled as an Azure Machine Learning dataset.


Text data must be available in an Azure blob datastore. (If you do not have an existing datastore, you may upload files during project creation.)

Data formats available for text data:

  • .txt: each file represents one item to be labeled.
  • .csv or .tsv: each row represents one item presented to the labeler. You decide which columns the labeler can see in order to label the row.


  • The data that you want to label, either in local files or in Azure blob storage.
  • The set of labels that you want to apply.
  • The instructions for labeling.
  • An Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
  • A Machine Learning workspace. See Create an Azure Machine Learning workspace.

Create a text labeling project

Labeling projects are administered from Azure Machine Learning. You use the Data Labeling page to manage your projects.

If your data is already in Azure Blob storage, you should make it available as a datastore before you create the labeling project.

  1. To create a project, select Add project. Give the project an appropriate name. The project name can’t be reused, even if the project is deleted in future.

  2. Select Text to create a text labeling project.

    Labeling project creation for text labeling

    • Choose Text Classification Multi-class for projects when you want to apply only a single label from a set of labels to each piece of text.
    • Choose Text Classification Multi-label for projects when you want to apply one or more labels from a set of labels to each piece of text.
    • Choose Text Named Entity Recognition for projects when you want to apply labels to individual or multiple words of text in each entry.
  3. Select Next when you're ready to continue.

Add workforce (optional)

Select Use a vendor labeling company from Azure Marketplace only if you've engaged a data labeling company from Azure Marketplace. Then select the vendor. If your vendor doesn't appear in the list, unselect this option.

Make sure you first contact the vendor and sign a contract. For more information, see Work with a data labeling vendor company (preview).

Select Next to continue.

Select or create a dataset

If you already created a dataset that contains your data, select it from the Select an existing dataset drop-down list. Or, select Create a dataset to use an existing Azure datastore or to upload local files.


A project cannot contain more than 500,000 files. If your dataset has more, only the first 500,000 files will be loaded.

Create a dataset from an Azure datastore

In many cases, it's fine to just upload local files. But Azure Storage Explorer provides a faster and more robust way to transfer a large amount of data. We recommend Storage Explorer as the default way to move files.

To create a dataset from data that you've already stored in Azure Blob storage:

  1. Select + Create .
  2. Assign a Name to your dataset, and optionally a description.
  3. Choose the Dataset type:
    • Select Tabular if you're using a .csv or .tsv file, where each row contains a response.
    • Select File if you're using separate .txt files for each response.
  4. Select Next.
  5. Select From Azure storage, then Next.
  6. Select the datastore, then select Next.
  7. If your data is in a subfolder within your blob storage, choose Browse to select the path.
    • Append "/**" to the path to include all the files in subfolders of the selected path.
    • Append "**/." to include all the data in the current container and its subfolders.
  8. Select Create.
  9. Now select the data asset you just created.

Create a dataset from uploaded data

To directly upload your data:

  1. Select + Create.
  2. Assign a Name to your dataset, and optionally a description.
  3. Choose the Dataset type:
    • Select Tabular if you're using a .csv or .tsv file, where each row contains a response.
    • Select File if you're using separate .txt files for each response.
  4. Select Next.
  5. Select From local files, then select Next.
  6. (Optional) Select a datastore. Or keep the default to upload to the default blob store ("workspaceblobstore") of your Machine Learning workspace.
  7. Select Next.
  8. Select Upload > Upload files or Upload > Upload folder to select the local files or folder(s) to upload.
  9. In the browser window, find your files or folder, then select Open.
  10. Continue using Upload until you have specified all your files/folders.
  11. Check the box Overwrite if already exists if you wish. Verify the list of files/folders.
  12. Select Next.
  13. Confirm the details. Select Back to modify the settings or Create to create the dataset.
  14. Now select the data asset you just created.

Configure incremental refresh

If you plan to add new files to your dataset, use incremental refresh to add these new files your project.

When incremental refresh at regular intervals is enabled, the dataset is checked periodically for new files to be added to a project, based on the labeling completion rate. The check for new data stops when the project contains the maximum 500,000 files.

Select Enable incremental refresh at regular intervals when you want your project to continually monitor for new data in the datastore.

Unselect if you don't want new files in the datastore to automatically be added to your project.


Don't create a new version for the dataset you want to update. If you do, the updates will not be seen, as the data labeling project is pinned to the initial version. Instead, use Azure Storage Explorer to modify your data in the appropriate folder in the blob storage. Also, don't remove data. Removing data from the dataset your project uses will cause an error in the project.

After the project is created, use the Details tab to change incremental refresh, view the timestamp for the last refresh, and request an immediate refresh of data.


Incremental refresh is available for projects that use tabular (.csv or .tsv) dataset input. However, only new tabular files are added. Changes to existing tabular files will not be recognized from the refresh.

Specify label categories

On the Label categories page, specify the set of classes to categorize your data. Your labelers' accuracy and speed are affected by their ability to choose among the classes. For instance, instead of spelling out the full genus and species for plants or animals, use a field code or abbreviate the genus.

You can use either a flat list or create groups of labels.

  • To create a flat list, select + Add label category to create each label.

    Screenshot: Add flat structure for labels.

  • To create labels in different groups, select + Add label category to create the top level labels. Then select the + under each top level to create the next level of labels for that category. You can create up to six levels for any grouping.

    Screenshot: Add groups of labels.

Labels at any level may be selected during the tagging process. For example, the labels Animal, Animal/Cat, Animal/Dog, Color, Color/Black, Color/White, and Color/Silver are all available choices for a label. In a multi-label project, there is no requirement to pick one of each category. If that is your intent, make sure to add this information in your instructions.

Describe the text labeling task

It's important to clearly explain the labeling task. On the Labeling instructions page, you can add a link to an external site for labeling instructions, or provide instructions in the edit box on the page. Keep the instructions task-oriented and appropriate to the audience. Consider these questions:

  • What are the labels they'll see, and how will they choose among them? Is there a reference text to refer to?
  • What should they do if no label seems appropriate?
  • What should they do if multiple labels seem appropriate?
  • What confidence threshold should they apply to a label? Do you want their "best guess" if they aren't certain?
  • What should they do with partially occluded or overlapping objects of interest?
  • What should they do if an object of interest is clipped by the edge of the image?
  • What should they do after they submit a label if they think they made a mistake?
  • What should they do if they discover image quality issues including poor lighting conditions, reflections, loss of focus, undesired background included, abnormal camera angles, and so on?
  • What should they do if there are multiple reviewers who have different opinions on the labels?


Be sure to note that the labelers will be able to select the first 9 labels by using number keys 1-9.

Quality control (preview)

To get more accurate labels, use the Quality control page to send each item to multiple labelers.


Consensus labeling is currently in public preview. The preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Select Enable consensus labeling (preview) to have each item sent to multiple labelers. Then set the Minimum labelers and Maximum labelers to specify how many labelers to use. Make sure you have as many labelers available as your maximum number. You can't later change these settings once the project has started.

If a consensus is reached from the minimum number of labelers, the item is labeled. If a consensus isn't reached, the item will be sent to more labelers. If there's no consensus after the item goes to the maximum number of labelers, its status will be Needs Review, and the project owner will be responsible for labeling the item.

Use ML-assisted data labeling

The ML-assisted labeling page lets you trigger automatic machine learning models to accelerate labeling tasks. ML-assisted labeling is available for both file (.txt) and tabular (.csv) text data inputs. To use ML-assisted labeling:

  • Select Enable ML assisted labeling.
  • Select the Dataset language for the project. All languages supported by the TextDNNLanguages Class are present in this list.
  • Specify a compute target to use. If you don't have one in your workspace, a compute cluster will be created for you and added to your workspace. The cluster is created with a minimum of 0 nodes, which means it doesn't cost anything when it's not in use.

How does ML-assisted labeling work?

At the beginning of your labeling project, the items are shuffled into a random order to reduce potential bias. However, any biases that are present in the dataset will be reflected in the trained model. For example, if 80% of your items are of a single class, then approximately 80% of the data used to train the model will be of that class.

For training the text DNN model used by ML-assist, the input text per training example will be limited to approximately the first 128 words in the document. For tabular input, all text columns are first concatenated before applying this limit. This is a practical limit imposed to allow for the model training to complete in a timely manner. The actual text in a document (for file input) or set of text columns (for tabular input) can exceed 128 words. The limit only pertains to what is internally used by the model during the training process.

The exact number of labeled items necessary to start assisted labeling isn't a fixed number. This can vary significantly from one labeling project to another, depending on many factors, including the number of labels classes and label distribution.

When you're using consensus labeling, the consensus label is used for training.

Since the final labels still rely on input from the labeler, this technology is sometimes called human in the loop labeling.


ML assisted data labeling does not support default storage accounts secured behind a virtual network. You must use a non-default storage account for ML assisted data labelling. The non-default storage account can be secured behind the virtual network.


After enough labels are submitted for training, the trained model is used to predict tags. The labeler now sees pages that contain predicted labels already present on each item. The task is then to review these predictions and correct any mis-labeled items before submitting the page.

Once a machine learning model has been trained on your manually labeled data, the model is evaluated on a test set of manually labeled items to determine its accuracy at different confidence thresholds. This evaluation process is used to determine a confidence threshold above which the model is accurate enough to show pre-labels. The model is then evaluated against unlabeled data. Items with predictions more confident than this threshold are used for pre-labeling.

Initialize the text labeling project

After the labeling project is initialized, some aspects of the project are immutable. You can't change the task type or dataset. You can modify labels and the URL for the task description. Carefully review the settings before you create the project. After you submit the project, you're returned to the Data Labeling homepage, which will show the project as Initializing.


This page may not automatically refresh. So, after a pause, manually refresh the page to see the project's status as Created.

Run and monitor the project

After you initialize the project, Azure will begin running it. Select the project on the main Data Labeling page to see details of the project.

To pause or restart the project, toggle the Running status on the top right. You can only label data when the project is running.


The Dashboard tab shows the progress of the labeling task.

Text data labeling dashboard

The progress charts shows how many items have been labeled, skipped, in need of review, or not yet done. Hover over the chart to see the number of items in each section.

Below the charts is a distribution of the labels for those tasks that are complete. Remember that in some project types, an item can have multiple labels, in which case the total number of labels can be greater than the total number items.

You also see a distribution of labelers and how many items they've labeled.

Finally, in the middle section, there is a table showing a queue of tasks yet to be assigned. When ML assisted labeling is off, this section shows the number of manual tasks to be assigned.

Additionally, when ML assisted labeling is enabled, scroll down to see the ML assisted labeling status. The Jobs sections give links for each of the machine learning runs.


On the Data tab, you can see your dataset and review labeled data. Scroll through the labeled data to see the labels. If you see incorrectly labeled data, select it and choose Reject, which will remove the labels and put the data back into the unlabeled queue.

If your project uses consensus labeling, you'll also want to review those images without a consensus. To do so:

  1. Select the Data tab.

  2. On the left, select Review labels.

  3. On the top right, select All filters.

    Screenshot: select filters to review consensus label problems.

  4. Under Labeled datapoints, select Consensus labels in need of review. This shows only those images where a consensus wasn't achieved among the labelers.

    Screenshot: Select labels in need of review.

  5. For each item in need of review, select the Consensus label dropdown to view the conflicting labels.

    Screenshot: Select Consensus label dropdown to review conflicting labels.

  6. While you can select an individual to see just their label(s), you can only update or reject the labels from the top choice, Consensus label (preview).

Details tab

View and change details of your project. In this tab you can:

  • View project details and input datasets
  • Enable or disable incremental refresh at regular intervals, or request an immediate refresh.
  • View details of the storage container used to store labeled outputs in your project
  • Add labels to your project
  • Edit instructions you give to your labels
  • Change settings for ML assisted labeling, and kick off a labeling task

Access for labelers

Anyone who has Contributor or Owner access to your workspace can label data in your project.

You can also add users and customize the permissions so that they can access labeling but not other parts of the workspace or your labeling project. For more information, see Add users to your data labeling project.

Add new labels to a project

During the data labeling process, you may want to add more labels to classify your items. For example, you may want to add an "Unknown" or "Other" label to indicate confusion.

Use these steps to add one or more labels to a project:

  1. Select the project on the main Data Labeling page.
  2. At the top right of the page, toggle Running to Paused to stop labelers from their activity.
  3. Select the Details tab.
  4. In the list on the left, select Label categories.
  5. Modify your labels. Add a label
  6. In the form, add your new label. Then choose how to continue the project. Since you've changed the available labels, you choose how to treat the already labeled data:
    • Start over, removing all existing labels. Choose this option if you want to start labeling from the beginning with the new full set of labels.
    • Start over, keeping all existing labels. Choose this option to mark all data as unlabeled, but keep the existing labels as a default tag for images that were previously labeled.
    • Continue, keeping all existing labels. Choose this option to keep all data already labeled as is, and start using the new label for data not yet labeled.
  7. Modify your instructions page as necessary for the new label(s).
  8. Once you've added all new labels, at the top right of the page toggle Paused to Running to restart the project.

Start an ML assisted labeling task

ML assisted labeling starts automatically after some items have been labeled. This automatic threshold varies by project. However, you can manually start an ML assisted training run, as long as your project contains at least some labeled data.


On-demand training is not available for projects created before December, 2022. Create a new project to use this feature.

Use the Details section to start a new ML assisted training run.

  1. At the top of your project, select Details.
  2. On the side navigation for Details, select ML assisted labeling
  3. Scroll to the bottom if necessary and select Start for On-demand training

Export the labels

Use the Export button on the Project details page of your labeling project. You can export the label data for Machine Learning experimentation at any time.

For all project types other than Text Named Entity Recognition, you can export:

For Text Named Entity Recognition projects, you can export:

  • An Azure Machine Learning dataset (v1) with labels.

  • A CoNLL file. For this export, you'll also have to assign a compute resource. The export process runs offline and generates the file as part of an experiment run. When the file is ready to download, you'll see a notification on the top right. Select this to open the notification, which includes the link to the file.

    Notification for file download.

Access exported Azure Machine Learning datasets in the Datasets section of Machine Learning. The dataset details page also provides sample code to access your labels from Python.

Exported dataset


Use these tips if you see any of these issues.

Issue Resolution
Only datasets created on blob datastores can be used. Known limitation of the current release.
Removing data from the dataset your project uses will cause an error in the project. Don't remove data from the version of the dataset you used in a labeling project. Create a new version of the dataset to use for removing data.
After creation, the project shows "Initializing" for a long time. Manually refresh the page. Initialization should complete at roughly 20 datapoints per second. The lack of autorefresh is a known issue.
Newly labeled items not visible in data review. To load all labeled items, choose the First button. The First button will take you back to the front of the list, but loads all labeled data.
Unable to assign set of tasks to a specific labeler. Known limitation of the current release.

Next steps