Thank you for reaching out to Microsoft Q&A forum!
When dealing with closely spaced text in PDF documents, the layout analysis model in Azure AI Document Intelligence may not always accurately separate text elements into distinct tokens.
Here are possible approaches as per your queries:
The manual labeling by drawing regions can accurately process documents with similar closely spaced text. By drawing regions around the text elements that need to be extracted, you can ensure that the model learns to recognize them as separate tokens. This should help you extract the text elements into separate fields in an effective way.
There is a recommended approach to provide the model with enough training examples that include closely spaced text that needs to be extracted into separate fields. By doing so, the model can learn to accurately tokenize such text and recognize it as separate tokens.
Yes, the model should be able to generalize this from the training examples and accurately separate closely spaced text in new, unseen documents if it has been trained on a diverse set of examples that are representative of the documents it will be processing. By providing more training documents that include closely spaced text is crucial to help the model learn to accurately tokenize such text. The more training examples you provide, the better the model will be at recognizing closely spaced text as separate tokens.
Hope this helps. And, if you have any further query do let us know.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful.