Spark pool - tabula.read_pdf error - No such file or directory

Nagesh CL 696 Reputation points
2023-10-25T14:23:08.7466667+00:00

Hi Team,

I am trying to fetch tabular data from pdf using the tabula in Spark Pool. The pdf file is kept in a datalake and I have mounted the same. The mounted path is provided as input to the read_pdf function. But, I am getting the error - "No such file or directory". The thing is, its not working only in Azure Synapse Spark Pool. The same code works fine in azure databricks workspace. I have tried with giving the abfs url (Data lake url) instead of mounted path and still the same error.User's image

There is no problem with the path here. The same path works if used in any other different function (Read the same pdf as binary file) as below: -User's image

To sum it up, I have couple of questions as below: -

What is the issue with tabula.read_pdf?

Is there a way to convert the binary data (In second screenshot) to human readable tabular data?

Thanks in advance.

Regards,

Nagesh CL

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2023-10-25T16:14:04.4366667+00:00

    If you're on a Unix-like system, you can use the ls -l command to check file permissions.

    You can use this code to verify the path :

    import os
    if os.path.exists(file_path):
        # Open and read the file
    else:
        print(f"{file_path} does not exist.")
    
    

    The binary content you have represents the raw content of the PDF file. To convert it into a human-readable tabular format directly in Spark without using tabula, you'd need a library or method that can parse PDF content in a distributed manner, which can be challenging.

    However, you can still work around this. One way is by using Azure Databricks (since you mentioned it works there). Read the PDF using tabula in Databricks, then convert the resultant table to a format like Parquet or CSV and save it to the data lake. Afterward, you can easily read this Parquet/CSV file in Azure Synapse Spark Pool.

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.