PDF tag text spanned across two pages to single field

Question

PDF tag text spanned across two pages to single field

Rohit D 26

We have PDFs from which we are trying to extract metadata using azure documnet intelligence/ form recognizer. Sample metadata are workign great using Custom extraction model. However we have this one field called summery section. This summary text appears usually at the bottom of page. Sometimes sunamry text finishes on first page however in many caes it flows down to second page. While training documents it seems we can tag fields only on single page and cant make selection (one with green text) flow to second page. So whenever we extract summary we get data only from first page. Is there way to train on summary text spanned on two pages? User's image

Saideep Anchuri 9,500 Reputation points Moderator

2025-01-22T04:40:04.2+00:00

Hi Rohit D

Just checking in to see if the below answer provided by @Sina Salam helped.

Thank You.

3 answers

Your answer

Saideep Anchuri 9,500 Reputation points Moderator

2025-01-22T04:40:04.2+00:00

Hi Rohit D

Just checking in to see if the below answer provided by @Sina Salam helped.

Thank You.

Answer 1

Hello Rohit D,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that your current custom extraction model only captures the summary from the first page.

To address the problem more effectively and ensure a comprehensive solution for extracting multi-page summary text from PDFs consider the following steps:

Try to use advanced PDF extraction tools such as Adobe Acrobat SDK and Apache PDFBox, they offer extensive capabilities for custom extraction, including handling multi-page fields and Apache is an open-source library that can be customized for complex extraction tasks.
You can use tools like PyPDF2 or PDFMiner to merge pages where the summary spans multiple pages before extraction.
Also, implement a script to check for incomplete summary text and concatenate text from subsequent pages. Python libraries like pandas can be useful for this.
If manual tagging is necessary, consider using tools like Labelbox or SuperAnnotate which offer more efficient tagging workflows.
You can also use machine learning to extract text, including multi-page documents and Google Cloud Vision API to handle complex document structures and extract text across multiple pages.

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Answer 2

Rohit D 26

@Sina Salam thank you for response. But, we are alredy using azure document intelligence. Other metadata are extracted fine. Not sure how azure sdk is going to help. PDF are mix of actual text and scanned documents hence we had to AI . question is specific to document intelligence training capability for tagging singke field on text across two pages.

Saideep Anchuri 9,500 Reputation points Moderator

2025-01-22T06:36:32.5833333+00:00

Hi Rohit D

To work around this, you may need to create two summary key field on page1 and 2 and concatenate the extracted text from both pages as one summary, using python SDK.

Hope this workaround works for you.

Thank You.
Rohit D 26 Reputation points

2025-01-22T06:38:11.8766667+00:00

Thank you for response. Unfortunately we dont have any control on the input PDf documents
Saideep Anchuri 9,500 Reputation points Moderator

2025-01-22T06:55:23.0733333+00:00

Hi Hi Rohit D

Could you give more clarity on control on input pdfs. are you not able to label portion of data.

Here is input tip on input data please process text and scanned documents as separate pdf

please refer below document stay abide input requirement: custom-model-input

Thank You.
Rohit D 26 Reputation points

2025-01-22T07:23:29.4366667+00:00

This is what i get for labling across two pages
Rohit D 26 Reputation points

2025-01-22T08:19:12.94+00:00

Thank you Saideep. Problem is , in some cases there will not be summary 2 . Anywyas, I get your point and its limitation with product so will have to explore other options. Thank you for your time
Saideep Anchuri 9,500 Reputation points Moderator

2025-01-22T08:23:00.02+00:00

Hi Rohit D

Thanks for your patience, from the above conversation please tell us if anything was helpful to you, so that we can convert it to answer. Then, if you could Accept Answer and Upvote it for the benefit of community, it will be helpful to others.

Thank you

Answer 3

Saideep Anchuri 9,500 Moderator

Hi Rohit D

You cannot use same field name across multiple as it is product limitation.

You have to create two summary fields named "summary1" and "summay2" and tag it on page1 and page2. So final summary will be summary1+summary2

Kindly refer below documentation: tabular-fields

reference thread cross page labelling: cross-page-labeling-limitation

Thank You.

Saideep Anchuri 9,500 Reputation points Moderator

2025-01-23T00:55:02.16+00:00

Hi Rohit D

Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thank You.

Share via

PDF tag text spanned across two pages to single field

3 answers

Your answer