Azure Databricks Read Text and tables from PDF files-python

CzarR 316 Reputation points
2022-08-22T22:02:02.87+00:00

We have pdf documents that contain scanned images. These images contain free text and table data. I would like to read the data from the tables and also the free text using python. End goal is to parse the entire pdf file and convert that to a json and store in table. SO far I have tried the following

Libraries (1.) through (4.) although they are free they are very inconsistent in reading the pdf files mostly because our pdf files are scanned images and tables have no borders.

1.) pip install camelot-py(free)

2.) pip install tabula-py(free)

3.) pip install PyPDF2(free)

4.) fitz - pdf to json(free)

5.) FormRecognizer(License)

6.) Johnsnowlabs(License)

FormRecognizer and Johnsnowlabs worked fine but due to the image brightness it is not able to parse the headers and certain column data.  

Is there any other OCR tool that I can try that integrates well with Azure databricks. Licensed version is fine too.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,100 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator
    2022-08-23T06:50:34.843+00:00

    @CzarR If you want to read text then the computer vision Read API is another option, but it would not be suited for documents with tables.
    If form recognizer works for you then you could try using the form recognizer SDK. I am not sure if you have used the prebuilt APIs of form recognizer or a custom model.
    A custom model can be trained for any discrepancies you have seen with the pre-built models. If the issue is with the document quality for prebuilt APIs, then you can perform some pre-processing on the docs through 3rd party tools to enhance the quality and check if it works.

    If an answer is helpful, please click on 130616-image.png or upvote 130671-image.png which might help other community members reading this thread.

    2 people found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.