Identify and split pdf with multiple invoices

Question

Identify and split pdf with multiple invoices

Kelvin 6

I have a pdf file with multiple invoices where each invoice could potentially have multiple pages. Supporting docs could also be included in pdf. I would like to split the pdf into single invoice files.

Is there any suggestions to do the identification on pages and split to different invoice files.

Thank you.

2 answers

Your answer

Answer 1

Ramr-msft 17,826

@Kelvin Thanks for the question. Microsoft OCR has added the capability to process a multi-page document to extract text only for selected pages or page range, which you should be able to directly use instead of splitting the pdf and sending individual images.

Please follow doc - What's new in Computer Vision? - Azure Cognitive Services | Microsoft Learn

or
You need to do image segmentation identifying the contours of the receipts. OpenCV can be used for this preprocessing step.
You can take a look at this notebook with sample implementation :
https://www.kaggle.com/dmitryyemelyanov/receipt-ocr-part-1-image-segmentation-by-opencv/notebook

Kelvin 6 Reputation points

2021-04-27T02:24:38.567+00:00

Thanks for your suggestions. Computer Vision could be a great solution if the page range is known. However, since the pdf file is combination of supporting docs and invoices with unknown page number, the page range of invoices is dynamic and unpredictable.
The reason of splitting pdf is to separate the pdf into multiple files. For every splitted invoice file, I would store it in SharePoint and capture text from it.

The pdf is either digital or scanned so identifying contours is not necessary in my case. But still thanks for your input.
Ramr-msft 17,826 Reputation points

2021-04-27T03:56:39.47+00:00

@Kelvin Thanks for the details. Knowledge Extraction Recipe: https://github.com/microsoft/knowledge-extraction-recipes-forms for preprocessing, You can also raise an feature request in the github repo. Python libraries such as pdfsplit or PyPDF2.

Answer 2

Uridah Sami 1

I am trying to do exactly the same thing. @Kelvin could you find a solution for this?

Share via

Identify and split pdf with multiple invoices

2 answers

Your answer