Tips for building labeled datasets
This article applies to: Document Intelligence v3.1.
Document Intelligence v3.0.
Document Intelligence v2.1.
This article highlights the best methods for labeling custom model datasets in the Document Intelligence Studio. Labeling documents can be time consuming when you have a large number of labels, long documents, or documents with varying structure. These tips should help you label documents more efficiently.
Video: Custom labels best practices
The following video is the second of two presentations intended to help you build custom models with higher accuracy (the first presentation explores How to create a balanced data set).
Here, we examine best practices for labeling your selected documents. With semantically relevant and consistent labeling, you should see an improvement in model performance.
Search
The Studio now includes a search box for instances when you know you need to find specific words to label, but just don't know where they're located in the document. Simply search for the word or phrase and navigate to the specific section in the document to label the occurrence.
Auto label tables
Tables can be challenging to label, when they have many rows or dense text. If the layout table extracts the result you need, you should just use that result and skip the labeling process. In instances where the layout table isn't exactly what you need, you can start with generating the table field from the values layout extracts. Start by selecting the table icon on the page and select on the auto label button. You can then edit the values as needed. Auto label currently only supports single page tables.
Shift select
When labeling a large span of text, rather than mark each word in the span, hold down the shift key as you're selecting the words to speed up labeling and ensure you don't miss any words in the span of text.
Region labeling
A second option for labeling larger spans of text is to use region labeling. When region labeling is used, the OCR results are populated in the value at training time. The difference between the shift select and region labeling is only in the visual feedback the shift labeling approach provides.
Field subtypes
When creating a field, select the right subtype to minimize post processing, for instance select the dmy
option for dates to extract the values in a dd-mm-yyyy
format.
Batch layout
When creating a project, select the batch layout option to prepare all documents in your dataset for labeling. This feature ensures that you no longer have to select on each document and wait for the layout results before you can start labeling.
Next steps
Learn more about custom labeling:
Learn more about custom template models:
Learn more about custom neural models:
Feedback
Submit and view feedback for