Read pdf file from storage account (Azure Data lake) without downloading it using python

Rachit Agarwal 0 Reputation points

I am trying to read a pdf file which I have uploaded on an Azure storage account. I am trying to do this using python. I have tried using the SAS token/URL of the file and pass it thorugh PDFMiner but I am not able get the path of the file which will be accepted by PDFMiner. I am using something like the below code:

rom import DataLakeServiceClient
from import generate_file_sas
import os
storage_account_name = "mystorageaccount"
storage_account_key = "mystoragekey"
container_name = "mycontainer"
directory_name = 'mydirectory'

service_client = DataLakeServiceClient(account_url="{}://{}".format(
        "https", storage_account_name), credential=storage_account_key)
file_system_client = service_client.get_file_system_client(file_system=container_name)
directory_client   = file_system_client.get_directory_client(directory_name)
file_client = directory_client.get_file_client('XXX.pdf')
download = file_client.download_file()
downloaded_bytes = download.readall()

file_sas = generate_file_sas(account_name= storage_account_name,file_system_name= container_name,directory_name= directory_name,file_name= dir_name,credential= storage_account_key)

from pdfminer.pdfpage import PDFPage
with open(downloaded_bytes, 'rb') as infile:
    PDFPage.get_pages(infile, check_extractable=False)

from pdfminer.pdfpage import PDFPage
with open(file_sas, 'rb') as infile:
    PDFPage.get_pages(infile, check_extractable=False)

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,270 questions
Azure Storage Accounts
Azure Storage Accounts
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
2,513 questions
{count} votes

1 answer

Sort by: Most helpful
  1. SaiKishor-MSFT 17,141 Reputation points

    @Rachit Agarwal Thanks for reaching out to Microsoft Q&A. I understand that you want to read a pdf file from storage account using Python, is that right?

    I think these links will be helpful to you to understand how to do the same-

    Sample code for ADLS Gen2 - Generate SAS Token-

    Read Data from a File-

    I hope these sample files will help you with your code. If it still does not work, please do share the error message that you are receiving. Thank you!

    0 comments No comments