Hello Rohit D,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that your current custom extraction model only captures the summary from the first page.
To address the problem more effectively and ensure a comprehensive solution for extracting multi-page summary text from PDFs consider the following steps:
- Try to use advanced PDF extraction tools such as Adobe Acrobat SDK and Apache PDFBox, they offer extensive capabilities for custom extraction, including handling multi-page fields and Apache is an open-source library that can be customized for complex extraction tasks.
- You can use tools like PyPDF2 or PDFMiner to merge pages where the summary spans multiple pages before extraction.
- Also, implement a script to check for incomplete summary text and concatenate text from subsequent pages. Python libraries like pandas can be useful for this.
- If manual tagging is necessary, consider using tools like Labelbox or SuperAnnotate which offer more efficient tagging workflows.
- You can also use machine learning to extract text, including multi-page documents and Google Cloud Vision API to handle complex document structures and extract text across multiple pages.
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.