If you're on a Unix-like system, you can use the ls -l
command to check file permissions.
You can use this code to verify the path :
import os
if os.path.exists(file_path):
# Open and read the file
else:
print(f"{file_path} does not exist.")
The binary content you have represents the raw content of the PDF file. To convert it into a human-readable tabular format directly in Spark without using tabula
, you'd need a library or method that can parse PDF content in a distributed manner, which can be challenging.
However, you can still work around this. One way is by using Azure Databricks (since you mentioned it works there). Read the PDF using tabula
in Databricks, then convert the resultant table to a format like Parquet or CSV and save it to the data lake. Afterward, you can easily read this Parquet/CSV file in Azure Synapse Spark Pool.