Creating a Zip File from Blob Storage Using Python in Azure Databricks

Chahat Malik 0 Reputation points
2024-07-24T06:12:46.7266667+00:00

Hello,

I am working on a task where I need to create a zip file for multiple files stored in blob storage, without having to read the files again or using local storage. I am using Python in Azure Databricks and would like to leverage its capabilities for this task. Can someone please provide some guidance or resources on how I can achieve this?

Thank you.

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,639 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,080 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Nehruji R 4,766 Reputation points Microsoft Vendor
    2024-07-25T10:40:53.89+00:00

    Hello Chahat Malik,

    Greetings! Welcome to Microsoft Q&A Platform.

    To create a zip file for multiple files stored in Azure Blob Storage without downloading them locally, you can use the azure-storage-blob library along with in-memory file handling. You can use Azure Databricks along with Azure Blob Storage to create a zip file for multiple files without reading them into local storage.

    • First, mount your Azure Blob Storage to Databricks.
    • Utilize the Azure Storage SDK for Python to interact with the blob storage.

    Create a Zip File in Memory: Use the io.BytesIO module to create a zip file in memory.

    Upload the Zip File: Finally, upload the zip file back to the blob storage.

    Note: While mounting Blob Storage, Ensure your Azure Blob Storage is mounted to Databricks. In-Memory Operations: The io.BytesIO module allows you to handle the zip file in memory, avoiding local storage. The Azure Storage SDK for Python provides the necessary methods to interact with blob storage.

    Sample code for reference,

    from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
    import io
    import zipfile
    
    # Initialize the BlobServiceClient
    blob_service_client = BlobServiceClient.from_connection_string("your_connection_string")
    
    # Specify the container and the list of blobs to be zipped
    container_name = "your_container_name"
    blob_names = ["file1.txt", "file2.txt", "file3.txt"]
    
    # Create a BytesIO object to hold the zip file in memory
    zip_buffer = io.BytesIO()
    
    # Create a ZipFile object
    with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
        for blob_name in blob_names:
            # Get the blob client
            blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
            
            # Download the blob content as bytes
            blob_data = blob_client.download_blob().readall()
            
            # Write the blob content to the zip file
            zip_file.writestr(blob_name, blob_data)
    
    # Seek to the beginning of the BytesIO object
    zip_buffer.seek(0)
    
    # Upload the zip file to blob storage
    zip_blob_client = blob_service_client.get_blob_client(container=container_name, blob="zipped_files.zip")
    zip_blob_client.upload_blob(zip_buffer.getvalue(), overwrite=True)
    
    print("Zip file created and uploaded successfully.")
    
    

    Similar ask thread for reference - https://stackoverflow.com/questions/18852389/generate-a-zip-file-from-azure-blob-storage-files, https://stackoverflow.com/questions/59713184/how-to-zip-files-on-azure-blob-storage-with-shutil-in-databricks.

    Hope this information helps! Please let us know if you have any further queries. I’m happy to assist you further.


    Please "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments