Azure Document Intelligence - prebuild-layout - find original page number in markdown result

Curati Filippo 0 Reputation points
2024-05-03T15:08:47.8466667+00:00

I'm using the Azure Document Intelligence service to analyze different types of documents. I set the output format style in Markdown to be able to have more information regarding the structure and formatting of the document.

User's image
To get the result in markdown in Python code I use the following code:

document_intelligence_client = DocumentIntelligenceClient(
            endpoint=os.getenv("DOCUMENT_INTELLIGENCE_ENDPOINT"),
            credential=AzureKeyCredential(os.getenv("DOCUMENT_INTELLIGENCE_KEY"))
        )

        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout",
            AnalyzeDocumentRequest(bytes_source=self.__file_bytes.read()),
            output_content_format=ContentFormat.MARKDOWN,
        )
        result: AnalyzeResult = poller.result()

This code works, no error. It returns correct Markdown formatted text.

If the original document contains a footer with the page number, a specific tag with the detected page number is reported in the Markdown result.

Document footer:

User's image

Markdown result tag:

User's image

However, if the original document does not have a footer with the page number, in the Markdown result I find no indication of the division of the pages, but it turns out to be a single entire document.

Analyzing the JSON structure returned by the Document Intelligence service I saw that a subdivision of the document into pages and lines is returned for each page.

(screenshot in python code)

User's image

I tried to rebuild the pages using the "content" property of lines elements, but the textual result is not the same as the entire text in Markdown.

Furthermore, it happens that if a page ends with a table or an image before the page footer with the page number, the page number does not turn out to be the last actual line of the page, this makes it difficult to structure an algorithm to identify the end of pages in the result in Markdown

User's image

User's image

Is there anyone who has encountered the same problems as me? Can anyone recommend an effective method for splitting the result in Markdown format while maintaining the page subdivision of the original document?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,433 questions
{count} votes