Azure Document Intelligence - prebuild-layout - find original page number in markdown result

Question

Azure Document Intelligence - prebuild-layout - find original page number in markdown result

Curati Filippo 10

I'm using the Azure Document Intelligence service to analyze different types of documents. I set the output format style in Markdown to be able to have more information regarding the structure and formatting of the document.

User's image
To get the result in markdown in Python code I use the following code:

document_intelligence_client = DocumentIntelligenceClient(
            endpoint=os.getenv("DOCUMENT_INTELLIGENCE_ENDPOINT"),
            credential=AzureKeyCredential(os.getenv("DOCUMENT_INTELLIGENCE_KEY"))
        )

        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout",
            AnalyzeDocumentRequest(bytes_source=self.__file_bytes.read()),
            output_content_format=ContentFormat.MARKDOWN,
        )
        result: AnalyzeResult = poller.result()

This code works, no error. It returns correct Markdown formatted text.

If the original document contains a footer with the page number, a specific tag with the detected page number is reported in the Markdown result.

Document footer:

User's image

Markdown result tag:

User's image

However, if the original document does not have a footer with the page number, in the Markdown result I find no indication of the division of the pages, but it turns out to be a single entire document.

Analyzing the JSON structure returned by the Document Intelligence service I saw that a subdivision of the document into pages and lines is returned for each page.

(screenshot in python code)

User's image

I tried to rebuild the pages using the "content" property of lines elements, but the textual result is not the same as the entire text in Markdown.

Furthermore, it happens that if a page ends with a table or an image before the page footer with the page number, the page number does not turn out to be the last actual line of the page, this makes it difficult to structure an algorithm to identify the end of pages in the result in Markdown

User's image

Is there anyone who has encountered the same problems as me? Can anyone recommend an effective method for splitting the result in Markdown format while maintaining the page subdivision of the original document?

santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-05-06T07:37:19.55+00:00

Hi @Curati Filippo,

Thank you for reaching out to Microsoft Q&A forum!

To split the Markdown output into pages while maintaining the page subdivision of the original document, you can try using the pageResults nodes in the JSON output returned by the Document Intelligence service. You can also use the bounding boxes of the elements on the page to determine the end of the page. Additionally, you can try using third-party libraries or tools that are specifically designed for splitting Markdown documents into pages.
For more info see: Document Intelligence layout model.

I hope you understand! Thank you.
santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-05-07T06:56:56.0733333+00:00

Hi @Curati Filippo,

We haven’t heard from you on the last response and was just checking back to see if the given response was helpful. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Thank you.
Curati Filippo 10 Reputation points

2024-05-07T07:43:27.5833333+00:00

Hi, thanks for the quick reply!

These are the nodes I find in the response to the Document Intelligence service call:

In the pages branch I don't find a textual content property, which represents the text of the page, so as to have the entire markdown content divided by pages

The single textual content property is found in every single line of the page. However, the text contained here does not represent the exact subdivision of the lines and the formatting of the text returned in the complete result formatted in markdown. This is why I was unable to divide the pages based on the starting and ending lines of each page node.

(As you can see, the line nodes do not represent the actual lines of markdown text)

To use bounding boxes, do you have any examples? I didn't find a box for the page, and the only boxes are related to lines or internal elements of the text. However, I don't know how to search for them in the complete text in markdown in order to divide it...

Regarding third-party libraries for splitting markdown text, what could be a library that can split a string in markdown without page references, in the pages of the original document? (PDF for example).

Thanks so much for any insights or suggestions!
santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-05-07T13:35:27.4766667+00:00

Hi @Curati Filippo,

I apologize for the trouble you are facing. You can try using the pages branch in the JSON output returned by the Document Intelligence service and grouping the line nodes based on their page number property. You can also use the bounding regions property of each line node to determine the end of the page.

Still if you face any error or unable to do. I request you to raise a support case through Azure portal This will allow you to get assistance from Azure support in resolving the issue you're facing.
Flo 0 Reputation points

2024-06-05T15:43:46.89+00:00

Hi,

did you find any solution to get the page number in markdown? I am facing the same issue and would be very interested.

Thx,

Flo
Curati Filippo 10 Reputation points

2024-06-10T13:59:12.3366667+00:00

Unfortunately I still haven't found a solution so far, I also tried to ask for support via StackOverflow, but at the moment without a solution for my case.
This is the reference to the issue on StackOverflow:
https://stackoverflow.com/questions/78424757/azure-document-intelligence-prebuild-layout-find-original-page-number-in-mar

1 answer

Your answer

santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-05-06T07:37:19.55+00:00

Hi @Curati Filippo,

Thank you for reaching out to Microsoft Q&A forum!

To split the Markdown output into pages while maintaining the page subdivision of the original document, you can try using the pageResults nodes in the JSON output returned by the Document Intelligence service. You can also use the bounding boxes of the elements on the page to determine the end of the page. Additionally, you can try using third-party libraries or tools that are specifically designed for splitting Markdown documents into pages.
For more info see: Document Intelligence layout model.

I hope you understand! Thank you.
santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-05-07T06:56:56.0733333+00:00

Hi @Curati Filippo,

We haven’t heard from you on the last response and was just checking back to see if the given response was helpful. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Thank you.
Curati Filippo 10 Reputation points

2024-05-07T07:43:27.5833333+00:00

Hi, thanks for the quick reply!

These are the nodes I find in the response to the Document Intelligence service call:

In the pages branch I don't find a textual content property, which represents the text of the page, so as to have the entire markdown content divided by pages

The single textual content property is found in every single line of the page. However, the text contained here does not represent the exact subdivision of the lines and the formatting of the text returned in the complete result formatted in markdown. This is why I was unable to divide the pages based on the starting and ending lines of each page node.

(As you can see, the line nodes do not represent the actual lines of markdown text)

To use bounding boxes, do you have any examples? I didn't find a box for the page, and the only boxes are related to lines or internal elements of the text. However, I don't know how to search for them in the complete text in markdown in order to divide it...

Regarding third-party libraries for splitting markdown text, what could be a library that can split a string in markdown without page references, in the pages of the original document? (PDF for example).

Thanks so much for any insights or suggestions!
santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-05-07T13:35:27.4766667+00:00

Hi @Curati Filippo,

I apologize for the trouble you are facing. You can try using the pages branch in the JSON output returned by the Document Intelligence service and grouping the line nodes based on their page number property. You can also use the bounding regions property of each line node to determine the end of the page.

Still if you face any error or unable to do. I request you to raise a support case through Azure portal This will allow you to get assistance from Azure support in resolving the issue you're facing.
Flo 0 Reputation points

2024-06-05T15:43:46.89+00:00

Hi,

did you find any solution to get the page number in markdown? I am facing the same issue and would be very interested.

Thx,

Flo
Curati Filippo 10 Reputation points

2024-06-10T13:59:12.3366667+00:00

Unfortunately I still haven't found a solution so far, I also tried to ask for support via StackOverflow, but at the moment without a solution for my case.
This is the reference to the issue on StackOverflow:
https://stackoverflow.com/questions/78424757/azure-document-intelligence-prebuild-layout-find-original-page-number-in-mar

Answer 1

Curati Filippo 10

Hi!

no, at the moment I haven't found a solution to the problem yet. I will try to insert a support request to find out more and find a solution

Xi, Jonathan 25 Reputation points

2024-12-16T16:49:52.4033333+00:00

Hello @Curati Filippo

We’re also interested in customizing the page number in the markdown. Just wondering if you’ve had any luck with this since the support request?
Curati Filippo 10 Reputation points

2024-12-17T08:06:04.1633333+00:00

Hi Jonathan,

after my request for support I was contacted by Azure support and we looked at the problem together.

They recognized it as a software problem and it was reported to the development department, but they didn't confirm to me whether it will be fixed and when.

At the moment I have no news unfortunately.

For my application I reconstructed the pages using the lines in the paragraphs, although I know this is not a precise method.
Xi, Jonathan 25 Reputation points

2024-12-18T15:32:06.5433333+00:00

Thanks @Curati Filippo for the quick reply! Just curious, if you just use the lines, do you still need the Layout model, or just using the Read model? The Read model is much cheaper than the Layout model. :)

Another solution, as I noticed in the new v4 API, the output contains "" for each page, probably we can replace this flag with the real page number. However, I can't find any documentation about the "", so not sure whether it's a temporary change. @santoshkc can you pls help confirm?
santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-12-18T15:47:25.9433333+00:00

Hi @Xi, Jonathan,

I would recommend creating a new thread on the same forum with as much details about your issue as possible. That would make sure that your issue has better visibility in the community.
Curati Filippo 10 Reputation points

2024-12-19T07:55:51.34+00:00

@Xi, Jonathan,

Using the Read Model is probably a more economical choice and in this case it would have had the same effectiveness.

I check now with v4 API version for the  tag because it could be the solution by replacing it with the page number!
Xi, Jonathan 25 Reputation points

2024-12-19T08:42:27.1933333+00:00

@santoshkc Sure, opened https://learn.microsoft.com/en-us/answers/questions/2133936/the-pagebreak-in-azure-ai-document-intelligence-v4.

Share via

Azure Document Intelligence - prebuild-layout - find original page number in markdown result

1 answer

Your answer