Read pdf file from storage account (Azure Data lake) without downloading it using python

Rachit Agarwal 0 Reputation points
2023-01-18T16:40:09.2633333+00:00

I am trying to read a pdf file which I have uploaded on an Azure storage account. I am trying to do this using python. I have tried using the SAS token/URL of the file and pass it thorugh PDFMiner but I am not able get the path of the file which will be accepted by PDFMiner. I am using something like the below code:



rom azure.storage.filedatalake import DataLakeServiceClient
from azure.storage.filedatalake import generate_file_sas
import os
storage_account_name = "mystorageaccount"
storage_account_key = "mystoragekey"
container_name = "mycontainer"
directory_name = 'mydirectory'

service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
        "https", storage_account_name), credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system=container_name)
directory_client   = file_system_client.get_directory_client(directory_name)
file_client = directory_client.get_file_client('XXX.pdf')
download = file_client.download_file()
downloaded_bytes = download.readall()

file_sas = generate_file_sas(account_name= storage_account_name,file_system_name= container_name,directory_name= directory_name,file_name= dir_name,credential= storage_account_key)

from pdfminer.pdfpage import PDFPage
with open(downloaded_bytes, 'rb') as infile:
    PDFPage.get_pages(infile, check_extractable=False)

from pdfminer.pdfpage import PDFPage
with open(file_sas, 'rb') as infile:
    PDFPage.get_pages(infile, check_extractable=False)

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,358 questions
Azure Storage Accounts
Azure Storage Accounts
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
2,722 questions
{count} votes

1 answer

Sort by: Most helpful
  1. SaiKishor-MSFT 17,201 Reputation points
    2023-01-27T07:39:19.0333333+00:00

    @Rachit Agarwal Thanks for reaching out to Microsoft Q&A. I understand that you want to read a pdf file from storage account using Python, is that right?

    I think these links will be helpful to you to understand how to do the same-

    https://stackoverflow.com/questions/62523166/how-can-i-generate-an-azure-blob-sas-url-in-python

    Sample code for ADLS Gen2 - Generate SAS Token- https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-file-share/samples/file_samples_authentication.py#L59

    Read Data from a File- https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-file-datalake/samples/datalake_samples_upload_download.py#L67

    I hope these sample files will help you with your code. If it still does not work, please do share the error message that you are receiving. Thank you!

    0 comments No comments