How to work with large pdf stored in azure blob storage

Shaheda Ansari 0 Reputation points
2025-03-27T01:13:47.7233333+00:00

We upload the large pdf to azure blob storage which has 6 to 10 thousands pages. After uploading we have to work on that pdf file and after working reupload the same file. For now we first download the file because we have work with that files but when we download the file and work with them there is issue related to memory. So can you please suggest is there any other way to upload pdf file page wise and in the end we will get the entire pdf. And in the same way for download.

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,200 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Nandamuri Pranay Teja 3,700 Reputation points Microsoft External Staff Moderator
    2025-03-27T02:07:26.4766667+00:00

    Hello Shaheda,

    I understand that you're looking for a way to upload and download large PDF files page by page to avoid memory issues.

    Please be informed that Block blobs allow for the uploading of large files by dividing them into smaller segments. In your situation, you can upload the PDF file one page at a time, treating each page as a separate block. After all pages have been uploaded, you can finalize the blocks to create the complete PDF file. This method enables you to manage individual pages without the necessity of downloading the entire file simultaneously.

    Memory management:

    In a similar manner, while downloading the file, you have the ability to obtain the blocks associated with each page and subsequently piece them together to form the entire PDF document. This method can enhance memory management when handling large PDF files.

    References:

    https://stackoverflow.com/questions/49281802/pdfs-in-azure-blob-storage-better-block-or-page-blobs

    Hope the above answer helps! Please let us know do you have any further queries.


    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members. 


  2. Venkatesan S 2,820 Reputation points Microsoft External Staff Moderator
    2025-03-28T11:48:48.3766667+00:00

    Hi @shaheda Ansari

    So can you please suggest is there any other way to upload pdf file page wise and in the end we will get the entire pdf. And in the same way for download.

    You can use the code below to upload an entire PDF file to azure blob storage. The output will show how many pages were uploaded using Python.

    Code:(Upload)

    import fitz  
    from azure.storage.blob import BlobServiceClient
    
    connection_string = "xxxxx"
    container_name = "sxxx"
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    
    def process_pdf_and_upload(pdf_path, blob_name):
        """Simulates page-wise processing and uploads only the full PDF at the end."""
        doc = fitz.open(pdf_path)
        total_pages = len(doc)
        
        # Simulating page uploads (without actually uploading)
        for i in range(total_pages):
            print(f"Uploading page {i+1} of {total_pages}...")
    
       
        with open(pdf_path, "rb") as pdf_file:
            pdf_bytes = pdf_file.read()
    
        blob_client = blob_service_client.get_blob_client(container_name, blob_name)
        blob_client.upload_blob(pdf_bytes, overwrite=True)
        
        print(f"Uploaded entire PDF ({total_pages} pages): {blob_name}")
    
    process_pdf_and_upload("<name of pdf>.pdf", "final_large_pdf1.pdf")
    

    Output:

    Uploading page 164 of 168...
    Uploading page 165 of 168...
    Uploading page 166 of 168...
    Uploading page 167 of 168...
    Uploading page 168 of 168...
    Uploaded entire PDF (168 pages): final_large_pdf1.pdf
    

    In my environment, I have a 168-page PDF file that was successfully uploaded to my Azure Blob Storage.

    enter image description here

    Code:(Download)

    import fitz  
    from azure.storage.blob import BlobServiceClient
    
    connection_string = "xxxxx"
    container_name = "xxxxx"
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    
    def download_and_process_pdf(blob_name, local_pdf_path):
        """Downloads a PDF from Azure Blob Storage and simulates page-wise processing."""
        blob_client = blob_service_client.get_blob_client(container_name, blob_name)
        
        # Download the blob
        pdf_data = blob_client.download_blob().readall()
    
        # Save locally
        with open(local_pdf_path, "wb") as file:
            file.write(pdf_data)
    
        print(f"Downloaded PDF: {blob_name}")
    
        doc = fitz.open(local_pdf_path)
        total_pages = len(doc)
    
    
        for i in range(total_pages):
            print(f"Processing page {i+1} of {total_pages}...")
    
    download_and_process_pdf("final_large_pdf1.pdf", "<pdf name>.pdf")
    

    Hope this answer helps! please let us know if you have any further queries. I’m happy to assist you further.

    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.