Azure Databricks Read Text and tables from PDF files-python

Question

Azure Databricks Read Text and tables from PDF files-python

CzarR 316

We have pdf documents that contain scanned images. These images contain free text and table data. I would like to read the data from the tables and also the free text using python. End goal is to parse the entire pdf file and convert that to a json and store in table. SO far I have tried the following

Libraries (1.) through (4.) although they are free they are very inconsistent in reading the pdf files mostly because our pdf files are scanned images and tables have no borders.

1.) pip install camelot-py(free)

2.) pip install tabula-py(free)

3.) pip install PyPDF2(free)

4.) fitz - pdf to json(free)

5.) FormRecognizer(License)

6.) Johnsnowlabs(License)

FormRecognizer and Johnsnowlabs worked fine but due to the image brightness it is not able to parse the headers and certain column data.

Is there any other OCR tool that I can try that integrates well with Azure databricks. Licensed version is fine too.

1 answer

Your answer

Answer 1

romungi-MSFT 48,906 Microsoft Employee Moderator

@CzarR If you want to read text then the computer vision Read API is another option, but it would not be suited for documents with tables.
If form recognizer works for you then you could try using the form recognizer SDK. I am not sure if you have used the prebuilt APIs of form recognizer or a custom model.
A custom model can be trained for any discrepancies you have seen with the pre-built models. If the issue is with the document quality for prebuilt APIs, then you can perform some pre-processing on the docs through 3rd party tools to enhance the quality and check if it works.

If an answer is helpful, please click on or upvote which might help other community members reading this thread.

CzarR 316 Reputation points

2022-08-23T13:42:17.723+00:00

Hi, what kind of tools can I use to improve the document quality? Can they be integrated with Azure databricks(python)?
romungi-MSFT 48,906 Reputation points Microsoft Employee Moderator

2022-08-24T07:03:05.37+00:00

@CzarR There is no integration of a package to improve the image quality. I think you could use opencv to improve the image quality. For example, this article on medium explains how this can be achieved.

Ref: https://medium.com/@joschuck

Share via

Azure Databricks Read Text and tables from PDF files-python

1 answer

Your answer