Azure Form Recognizer duplicating text extracted from PDF

Jakub Lubowicki 1 Reputation point
2021-04-09T06:28:50.92+00:00

While extracting values using Azure Form Recognizer, many values are shown duplicated.

I have trained a custom model labelling the appropriate key values. I find that the OCR duplicates the boxes, so that when I am labelling using the sample labeling tool I often get one box inside the other.I need to pick one and deselect the other, to avoid showing the value duplicated.

When I run the model to predict a new PDF for many keys I also get the values duplicated.

Furthermore, upon inspection of the Result JSON I can see that many Lines have the Bounded Boxes nested, or overlapping. That is, typically you would have a Line that has a bounded box and text associated that in turn have "Words" that have a bounded box inside the Bounded Box of the Line.

Just to clarify, in the JSON I am seeing Lines, that have overlapping or nested Bounded Boxes and therefore text.

Any clues as to why this can be?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,388 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Jakub Lubowicki 1 Reputation point
    2021-04-09T12:16:58.48+00:00
    0 comments No comments