Handling Closely Spaced Text in Azure AI Document Intelligence Custom Extraction

Syed Umair Hasan 90 Reputation points
2024-04-04T23:07:52.3033333+00:00

Hello, I’m currently developing a custom extraction model using Azure AI Document Intelligence and I've encountered a challenge with the tokenization of closely spaced text in PDF documents. When I run the layout analysis on my PDFs and start labeling text for fields and tables, some text elements that are close to each other are being recognized as a single token.

User's image

User's image

This poses a problem for my model as I need to extract these elements into separate fields.

For example, text within parentheses like "(123-132)(PRD)" needs to be split into two distinct tokens, but the current model recognizes it as one.

If I try manual labeling by drawing regions, will it be able to accurately processes documents with similar closely spaced text?
User's image

Is there a recommended approach to ensure the model learns to correctly tokenize such closely spaced text when it is trained? Will the model be able to generalize this from the training examples and accurately separate closely spaced text in new, unseen documents?

Thank you!

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,405 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,415 questions
{count} votes

1 answer

Sort by: Most helpful
  1. santoshkc 4,435 Reputation points Microsoft Vendor
    2024-04-05T05:27:43.0466667+00:00

    Hi @Syed Umair Hasan,

    Thank you for reaching out to Microsoft Q&A forum!

    When dealing with closely spaced text in PDF documents, the layout analysis model in Azure AI Document Intelligence may not always accurately separate text elements into distinct tokens.

    Here are possible approaches as per your queries:

    The manual labeling by drawing regions can accurately process documents with similar closely spaced text. By drawing regions around the text elements that need to be extracted, you can ensure that the model learns to recognize them as separate tokens. This should help you extract the text elements into separate fields in an effective way.

    There is a recommended approach to provide the model with enough training examples that include closely spaced text that needs to be extracted into separate fields. By doing so, the model can learn to accurately tokenize such text and recognize it as separate tokens.

    Yes, the model should be able to generalize this from the training examples and accurately separate closely spaced text in new, unseen documents if it has been trained on a diverse set of examples that are representative of the documents it will be processing. By providing more training documents that include closely spaced text is crucial to help the model learn to accurately tokenize such text. The more training examples you provide, the better the model will be at recognizing closely spaced text as separate tokens.

    Hope this helps. And, if you have any further query do let us know.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.