How come when using the Azure document Intelligence read model creating a searchable PDF based on a PDF with a cropbox in the pages, creates bboxes (word polygons) on the wrong position. The module should make corrections for the cropbox offset.

Question

When providing a multiage PDF with a cropbox on the pages to the "2024-07-31-preview" API version, creates a searchable PDF with bboxes (word polygons) on the wrong position.
Looks like the word polygon is not corrected by the cropbox offset-x and offset-y.

When however I correct the original PDF for the cropbox content and provide this to the Azure document Intelligence read model, the downloaded PDF has the bboxes on the correct position.

Correcting the original PDF is what I would like to avoid, since I would always have to check if the original PDF contains a cropbox.
I would rather provide the original PDF and have the correct position of the word polygons.

I have added code snippets in c# in how I adjust the original PDF and how to send the adjusted PDF to the Form Recognizer endpoint.

Code snippets.txt

Answer

Thanks for your quick response. These are the files.

The original file:

Multi Page.pdf
The searchable pdf created by the Document intelligence studio (incorrect polygons):
MultiPage_IncorrectPolygons.pdf

The original file corrected for the cropbox:
Multipage_NoCropBox.pdf

The searchable pdf created by the Document intelligence studio (correct polygons):
MultiPage_OCR.pdf

Share via

How come when using the Azure document Intelligence read model creating a searchable PDF based on a PDF with a cropbox in the pages, creates bboxes (word polygons) on the wrong position. The module should make corrections for the cropbox offset.

1 answer

Your answer