Identify and split pdf with multiple invoices

Kelvin 6 Reputation points
2021-04-26T08:32:28.093+00:00

I have a pdf file with multiple invoices where each invoice could potentially have multiple pages. Supporting docs could also be included in pdf. I would like to split the pdf into single invoice files.

Is there any suggestions to do the identification on pages and split to different invoice files.

Thank you.

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,599 questions
0 comments No comments
{count} vote

2 answers

Sort by: Most helpful
  1. Ramr-msft 17,826 Reputation points
    2021-04-26T11:48:21.627+00:00

    @Kelvin Thanks for the question. Microsoft OCR has added the capability to process a multi-page document to extract text only for selected pages or page range, which you should be able to directly use instead of splitting the pdf and sending individual images.

    Please follow doc - What's new in Computer Vision? - Azure Cognitive Services | Microsoft Learn

    or
    You need to do image segmentation identifying the contours of the receipts. OpenCV can be used for this preprocessing step.
    You can take a look at this notebook with sample implementation :
    https://www.kaggle.com/dmitryyemelyanov/receipt-ocr-part-1-image-segmentation-by-opencv/notebook


  2. Uridah Sami 1 Reputation point
    2022-01-03T04:24:39.463+00:00

    I am trying to do exactly the same thing. @Kelvin could you find a solution for this?

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.