Spark pool - tabula.read_pdf error - No such file or directory

Question

Spark pool - tabula.read_pdf error - No such file or directory

Nagesh CL 696

Hi Team,

I am trying to fetch tabular data from pdf using the tabula in Spark Pool. The pdf file is kept in a datalake and I have mounted the same. The mounted path is provided as input to the read_pdf function. But, I am getting the error - "No such file or directory". The thing is, its not working only in Azure Synapse Spark Pool. The same code works fine in azure databricks workspace. I have tried with giving the abfs url (Data lake url) instead of mounted path and still the same error. User's image

There is no problem with the path here. The same path works if used in any other different function (Read the same pdf as binary file) as below: - User's image

To sum it up, I have couple of questions as below: -

What is the issue with tabula.read_pdf?

Is there a way to convert the binary data (In second screenshot) to human readable tabular data?

Thanks in advance.

Regards,

Nagesh CL

PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-11-06T06:48:59.3433333+00:00

@Nagesh CL - Just checking in to see if the below answer provided by @Amira Bedhiafi helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-11-10T08:32:19.5+00:00

@Nagesh CL - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-11-06T06:48:59.3433333+00:00

@Nagesh CL - Just checking in to see if the below answer provided by @Amira Bedhiafi helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-11-10T08:32:19.5+00:00

@Nagesh CL - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Amira Bedhiafi 33,631 Volunteer Moderator

If you're on a Unix-like system, you can use the ls -l command to check file permissions.

You can use this code to verify the path :

import os
if os.path.exists(file_path):
    # Open and read the file
else:
    print(f"{file_path} does not exist.")

The binary content you have represents the raw content of the PDF file. To convert it into a human-readable tabular format directly in Spark without using tabula, you'd need a library or method that can parse PDF content in a distributed manner, which can be challenging.

However, you can still work around this. One way is by using Azure Databricks (since you mentioned it works there). Read the PDF using tabula in Databricks, then convert the resultant table to a format like Parquet or CSV and save it to the data lake. Afterward, you can easily read this Parquet/CSV file in Azure Synapse Spark Pool.

Nagesh CL 696 Reputation points

2023-10-26T13:21:12.5666667+00:00

Hi @Amira Bedhiafi ,

Thanks for responding. There is no problem with the path and the permission. We have verified it. The problem is happening only with pdf. csv, parquet everything is accessible. The thing is, we have designed our entire ingestion framework using spark pool. Now just for pdf, the client is not inclined to provisioning a new databricks workspace. Anybody else who can help here to achieve this task with SparkPool is highly appreciated.

Regards,

Nagesh CL
Amira Bedhiafi 33,631 Reputation points Volunteer Moderator

2023-10-26T13:38:54.5133333+00:00

Can you check these threads ?

https://stackoverflow.com/questions/67923790/tabula-directory-to-read-pdf

https://stackoverflow.com/questions/62604522/tabula-filenotfounderror-errno-2-but-file-path-is-corrent

Share via

Spark pool - tabula.read_pdf error - No such file or directory

1 answer

Your answer