Sample c# Code to compare file Checksums of files migrated from DataLake Gen1 to Gen2

Amie Barkes 41 Reputation points
2021-03-12T16:09:37.08+00:00

I am in the process of migrating some Non Production Data from ADLS Gen 1 to ADLS Gen2.
I want to use Azure Batch to run c# program to compare the files by Name, by Size and then by Checksum (Sha or md5).
Does anyone have any sample c# code to get the file contents from Gen 1 and Gen 2 using a filestream then create the checkusums using a hashing algorithm?
I have attempted this but the results are different (the checksums are different) which maakes me think that my method is incorrect.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,425 questions
Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
321 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,061 Reputation points
    2021-03-15T17:09:34.667+00:00

    Hello @Amie Barkes and welcome to Microsoft Q&A.

    Is the goal, moving data primarily, or is the goal comparisons? Is it important that you use C# code to do this?

    I am asking because Data Factory has a feature for verifying checksum after copying data. However if you are doing checksums and comparisons before moving, Data Factory is not a good tool.


  2. Amie Barkes 61 Reputation points
    2021-03-16T17:38:22.43+00:00

    Having enabled the logging feature it would appear that the MD5 Checksum is not created for the data read from Azure Data Lake Gen1 although the File size and last modified date is available in the log files. The documentation has the following note which is a little vague...

    When copying binary files from, or to Azure Blob or Azure Data Lake Storage Gen2, ADF does block level MD5 checksum verification leveraging Azure Blob API and Azure Data Lake Storage Gen2 API. If ContentMD5 on files exist on Azure Blob or Azure Data Lake Storage Gen2 as data sources, ADF does file level MD5 checksum verification after reading the files as well. After copying files to Azure Blob or Azure Data Lake Storage Gen2 as data destination, ADF writes ContentMD5 to Azure Blob or Azure Data Lake Storage Gen2 which can be further consumed by downstream applications for data consistency verification.

    ADF does file size verification when copying binary files between any storage stores.

    0 comments No comments

  3. Amie Barkes 61 Reputation points
    2021-03-17T08:53:39.787+00:00

    Hello @MartinJaffer-MSFT
    I found a way to cause failure by accident. There is a feature on the Copy Data which allows you to Preserve various file permissions. Selecting all the options (ACL, Owner, Group) caused a write failure during the copy process which was captured in the logs.

    43:58.1 Warning FileSkip 20190404/12/33 File is skipped after read 0 bytes: ErrorCode=AdlsGen2OperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code 'Forbidden'. Account: 'zukien1dvdaladlsg203'. FileSystem: 'ihp-picp'. Path: 'data/z-uki-en1-dv-dala-hub03/ihp-quotedata/20190404/12/33'. ErrorCode: 'AuthorizationPermissionMismatch'.

    My only issue now is to find out why the MD5 Checksum is not created when the file is read from ADLS Gen 1 storage.

    0 comments No comments