DocumentIntelligenceClient returning 1 page of .docx documents

Thibault Verlinde 120 Reputation points
2024-03-18T15:04:48.7966667+00:00

Hiya

I have some code that analyzes .pdf and .docx documents. When analyzing .pdf documents, i get an output with a number of pages equal to the pages of the .pdf document. When analyzing a word document, i get all output, but it is contained in 1 page. I'm using the newest prerelease nuget package of AI.DocumentIntelligence
Is this expected behaviour?

I'm asking this since I'm chunking all pages/content for a search index. Every chunk also contains a link to the source file with the original startpage of the chunkdata. This shouldn't be permanently page 1 with word documents, but I can't figure out another way to calculate the page number.
Quick example: a pdf with 14 pages will be analyzed and also return 14 analyzed pages. A docx with 14 pages will be analyzed and only return 1 analyzed page, but the content of the 14 pages are inside this.
With this output in mind, how do i calculate which content is on which page in the word document?

Thanks in advance for your help/suggestions!

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,116 questions
0 comments No comments
{count} votes

Accepted answer
  1. VasaviLankipalle-MSFT 18,676 Reputation points Moderator
    2024-03-18T20:42:00.3333333+00:00

    Hello @Thibault Verlinde , Thanks for using Microsoft Q&A Platform.

    Yes, this is the expected behavior. In the Word document, up to 3,000 characters is considered one page unit. Additionally, there is no bounding polygon or bounding region information for each detected object, and page range (pages) is not supported as a parameter.

    Versions 2024-02-29-preview, 2023-10-31-preview, and later support Microsoft office (DOCX, XLSX, PPTX) and HTML files. The following features are not supported:

    • There are no angle, width/height and unit with each page object.
    • For each object detected, there is no bounding polygon or bounding region.
    • Page range (pages) is not supported as a parameter.
    • No lines object.

    From the documentation: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-read?view=doc-intel-4.0.0#pagesUser's image

    I hope this helps.

    Regards,

    Vasavi

    -Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.