md5 hash calculation for large files

Ajin S 1 Reputation point
2021-02-22T07:50:10.503+00:00

In Azure blob storage I want to upload a file having size more than 5GB into a block. When I went through the docs I found that for large files with more than block size(max block size is 100MB) the file is divided into multiple blocks and azure isn't able to calculate the hash for the whole file. But when we tried to upload a large file using azure sdk we found the large file is saved into a block and it has some md5 hash value which is a complete hash of the large file. I have checked the azure md5 hash and the locally generated hash for large file, both remains same. So I am bit confused on how does the large files are saved in block blobs and how the md5 hash is generated for large files.

Can somebody help me with a answer. Thanks in advance.

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,500 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sumarigo-MSFT 44,096 Reputation points Microsoft Employee
    2021-02-23T19:49:59.083+00:00

    @ Welcome to Microsoft Q&A, Thank you for posting your query! Firstly, apologies for the delay in responding here and any inconvenience this issue may have caused.

    The best practice is that you always calculate hash of a downloaded blob and keep it as baseline to compare next time.

    Blobs uploaded by PutBlob will have Content-MD5 calculated by Storage service. But Blobs uploaded PutBlock/PutBlockList won’t have it, and client needs to calculate locally and set it to x-ms-blob-content-md5 Blob property explicitly. When client doesn’t do it, it’s empty. Above recommendation is based on such different cases. Reference: https://technet2.github.io/Wiki/blogs/windowsazurestorage/windows-azure-blob-md5-overview.html

    There is a similar thread discussion in the SO forum, please refer to the suggestion, which gives some idea on your query.

    For larger files storage does not calculate the MD5 hash of the full blob because each block is written separately. You can work around this by calculating and manually setting the md5 hash when uploading your files. Note the md5 hash is in base64. See the example below for how to upload a blob while calculating and setting the Content-MD5 property:

    az storage blob upload -c test -n md5test -f ./test.img --content-md5 cat test.img | openssl dgst -md5 -binary | base64

    • MD5 hash checks on Azure Blob Storage files
    • Smaller files are not an issue since all the files smaller than 64MB will have Content-MD5 populated by the platform. For larger files, we can either have an azure function that can react on the events or perform a batch operation by spinning up some VMs and calculating MD5.

    Additional information: “The MD5 hash calculated from the downloaded data does not match the MD5 hash stored in the property of source: For more information see here

    If you still find any difficulties, please let me know. I would be interested to work closer on this issue

    -------------------------------------------------------------------------------------------------------------------------------------------------------------------

    Hope this helps!

    Kindly let us know if the above helps or you need further assistance on this issue.

    Please don’t forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.