How come when using the Azure document Intelligence read model creating a searchable PDF based on a PDF with a cropbox in the pages, creates bboxes (word polygons) on the wrong position. The module should make corrections for the cropbox offset.

John van Lennep 0 Reputation points
2024-08-29T12:51:50.0533333+00:00

When providing a multiage PDF with a cropbox on the pages to the "2024-07-31-preview" API version, creates a searchable PDF with bboxes (word polygons) on the wrong position.
Looks like the word polygon is not corrected by the cropbox offset-x and offset-y.

When however I correct the original PDF for the cropbox content and provide this to the Azure document Intelligence read model, the downloaded PDF has the bboxes on the correct position.

Correcting the original PDF is what I would like to avoid, since I would always have to check if the original PDF contains a cropbox.
I would rather provide the original PDF and have the correct position of the word polygons.

I have added code snippets in c# in how I adjust the original PDF and how to send the adjusted PDF to the Form Recognizer endpoint.

Code snippets.txt

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,713 questions
{count} votes

1 answer

Sort by: Most helpful
  1. John van Lennep 0 Reputation points
    2024-08-30T06:45:35.2933333+00:00

    Thanks for your quick response. These are the files.

    The original file:

    Multi Page.pdf
    The searchable pdf created by the Document intelligence studio (incorrect polygons):
    MultiPage_IncorrectPolygons.pdf

    The original file corrected for the cropbox:
    Multipage_NoCropBox.pdf

    The searchable pdf created by the Document intelligence studio (correct polygons):
    MultiPage_OCR.pdf


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.