PDF tag text spanned across two pages to single field

Rohit D 26 Reputation points
2025-01-21T15:44:52.45+00:00

We have PDFs from which we are trying to extract metadata using azure documnet intelligence/ form recognizer. Sample metadata are workign great using Custom extraction model. However we have this one field called summery section. This summary text appears usually at the bottom of page. Sometimes sunamry text finishes on first page however in many caes it flows down to second page. While training documents it seems we can tag fields only on single page and cant make selection (one with green text) flow to second page. So whenever we extract summary we get data only from first page. Is there way to train on summary text spanned on two pages? User's image

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,131 questions
{count} votes

3 answers

Sort by: Most helpful
  1. Sina Salam 22,031 Reputation points Volunteer Moderator
    2025-01-21T22:23:33.49+00:00

    Hello Rohit D,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your current custom extraction model only captures the summary from the first page.

    To address the problem more effectively and ensure a comprehensive solution for extracting multi-page summary text from PDFs consider the following steps:

    • Try to use advanced PDF extraction tools such as Adobe Acrobat SDK and Apache PDFBox, they offer extensive capabilities for custom extraction, including handling multi-page fields and Apache is an open-source library that can be customized for complex extraction tasks.
    • You can use tools like PyPDF2 or PDFMiner to merge pages where the summary spans multiple pages before extraction.
    • Also, implement a script to check for incomplete summary text and concatenate text from subsequent pages. Python libraries like pandas can be useful for this.
    • If manual tagging is necessary, consider using tools like Labelbox or SuperAnnotate which offer more efficient tagging workflows.
    • You can also use machine learning to extract text, including multi-page documents and Google Cloud Vision API to handle complex document structures and extract text across multiple pages.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.

    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

  2. Rohit D 26 Reputation points
    2025-01-22T05:31:47.82+00:00

    @Sina Salam thank you for response. But, we are alredy using azure document intelligence. Other metadata are extracted fine. Not sure how azure sdk is going to help. PDF are mix of actual text and scanned documents hence we had to AI . question is specific to document intelligence training capability for tagging singke field on text across two pages.


  3. Saideep Anchuri 9,500 Reputation points Moderator
    2025-01-22T08:15:28.8466667+00:00

    Hi Rohit D

    You cannot use same field name across multiple as it is product limitation.

    You have to create two summary fields named "summary1" and "summay2" and tag it on page1 and page2. So final summary will be summary1+summary2

    Kindly refer below documentation: tabular-fields

    reference thread cross page labelling: cross-page-labeling-limitation

    Thank You.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.