Azure Form Recognizer duplicating text extracted from PDF

Jakub Lubowicki 1

While extracting values using Azure Form Recognizer, many values are shown duplicated.

I have trained a custom model labelling the appropriate key values. I find that the OCR duplicates the boxes, so that when I am labelling using the sample labeling tool I often get one box inside the other.I need to pick one and deselect the other, to avoid showing the value duplicated.

When I run the model to predict a new PDF for many keys I also get the values duplicated.

Furthermore, upon inspection of the Result JSON I can see that many Lines have the Bounded Boxes nested, or overlapping. That is, typically you would have a Line that has a bounded box and text associated that in turn have "Words" that have a bounded box inside the Bounded Box of the Line.

Just to clarify, in the JSON I am seeing Lines, that have overlapping or nested Bounded Boxes and therefore text.

Any clues as to why this can be?

Ramr-msft 17,736 Reputation points

2021-04-09T11:50:39.987+00:00

@Jakub Lubowicki Thanks for the question. Can you please share the sample input document to check on this. Also please share the screenshot and JSON response that you are getting.
Please follow the document to Train a custom model using the sample labeling tool.

1 answer

Jakub Lubowicki 1 Reputation point

2021-04-09T12:16:58.48+00:00

86236-data-00000004.pdf
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Azure Form Recognizer duplicating text extracted from PDF

1 answer

Your answer