Uploading to a blob in parallel with Python SDK

Anonymous
2022-09-16T13:50:37.613+00:00

Hi there,

It seems that there was an option a while ago to manage parallel uploads using the Python sdk :
https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blockblobservice.blockblobservice?view=azure-python-previous&viewFallbackFrom=azure-python#azure-storage-blob-blockblobservice-blockblobservice-create-blob-from-path

Seems that nowadays this option has been removed.
The documentation of blobs shows that if your upload is large, it is definitely a good idea to upload chunks in parallel :
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist#upload-one-large-blob-quickly

I haven't been able to find a good answer if you can speed up the upload with the current python sdk. I was wondering if it is simply masked and done automatically in the background or it's just not implemented yet.
I've been using the upload_blob method upload_blob
https://learn.microsoft.com/en-us/azure/developer/python/sdk/storage/azure-storage-blob/azure.storage.blob.blobclient?view=storage-py-v12#upload-blob-data--blob-type--blobtype-blockblob---blockblob----length-none--metadata-none----kwargs-

but it definitely takes a while with a good fibre connection and I don't believe it should take that long (17 minutes for 4GB). I'm sure I've missed something, is there a better alternative to upload files to a blob quicker ?
Thanks

Azure Storage
Azure Storage
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
3,529 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,192 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Sumarigo-MSFT 47,466 Reputation points Microsoft Employee Moderator
    2022-09-30T17:58:09.353+00:00

    @Anonymous There is an E2E latency in your Storage Account. It measures the interval from when Azure Storage Server receives the first packet of the request until Azure Storage Server receives a client acknowledgment on the last packet of the response. In simpler terms it means the round trip of any operation starting at the client application, plus the time taken for processing the request at Storage Server and then coming back to the client application.

    The common attributes that can be checked to see if there is performance issue with the client application/VM are:

    1. High CPU on the client: When you are running the application, check if there is spike in the CPU of the client machine. If the CPU is higher than expected, it will result in performance issue for the application and in turn increase the latency value.
    2. Low available memory on the client: Check if you are running low on memory on the client machine.
    3. Running out of network bandwidth on the client.
    4. Misconfigured client application.

    This article will help you in resolving the issue: https://techcommunity.microsoft.com/t5/azure-paas-blog/how-to-isolate-latency-issue-for-azure-storage-account/ba-p/1430656

    You can refer to the below screenshot . Storage Account->Metric-> Add metric-> E2E latency and Server latency. There you will be able to see the comparison, E2E latency issue

    246594-image.png

    While parallelism can be great for performance, be careful about using unbounded parallelism, meaning that there is no limit enforced on the number of threads or parallel requests. Be sure to limit parallel requests to upload or download data, to access multiple partitions in the same storage account, or to access multiple items in the same partition. If parallelism is unbounded, your application can exceed the client device's capabilities or the storage account's scalability targets, resulting in longer latencies and throttling.

    Python parallel

    max_block_size
    int
    The maximum chunk size for uploading a block blob in chunks. Defaults to 4*1024*1024, or 4MB.

    max_concurrency
    int
    Maximum number of parallel connections to use when the blob size exceeds 64MB.

    https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#azure-storage-blob-blobclient-upload-blob

    This checklist identifies key practices that developers can follow to optimize performance. Keep these practices in mind while you are designing your application and throughout the process. Performance and scalability checklist for Blob storage

    Please let us know if you have any further queries. I’m happy to assist you further.


    Please do not forget to 246460-screenshot-2021-12-10-121802.png and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.