Hash calculation strategy for datatypes mismatch

Janice Chi 140 Reputation points
2025-06-14T13:08:58.0566667+00:00

User's image

In our current project, we are migrating data from an on-premises IBM DB2 system to Azure SQL Hyperscale, using Azure Databricks for transformation and reconciliation. This includes both batch and CDC-based pipelines.

Our project requirement is not just to ingest, but also to perform strict reconciliation between source (DB2) and target (Hyperscale). We have decided to:

Compute row count and row-level hash values inside both DB2 and Hyperscale.

Then bring these hashes and counts into Databricks for comparison and reconciliation.

Questions:

Is this a recommended and reliable approach for reconciliation — calculating row-level hash values inside both source (DB2) and target (Hyperscale), and comparing them in Databricks?

For the following data type transformation pairs (see table below), what is the recommended step-by-step process to ensure that hash values match despite the data type differences? Specifically:

How hash is likely calculated in DB2 vs Hyperscale for each type.

  What adjustments or normalization we should do in Databricks before comparison.
  
     Should we exclude any of these columns from hash if lossless normalization is not possible?
     
Source Data Type Target Data Type
timestamp datetime2
timestamp datetime2
integer int
char(15) int
varchar(16) char(16)
varchar(50) nvarchar(50)
char(25) nchar(25)
varchar(30) nvarchar(30)
varchar(60) nvarchar(60)
decimal(12,2) numeric(12,2)
decimal(10,0) bigint
decimal(24,12) numeric(24,12)
xml ccsid xml
varchar(16) nvarchar(16)

We would appreciate Microsoft’s guidance on whether this reconciliation method is production-grade, and how to best handle data type mismatches to ensure hash consistency.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator
    2025-06-16T00:57:47.9833333+00:00

    Hi @Janice Chi

    Yes, the approach of computing row-level hashes on both DB2 and Azure SQL Hyperscale, then reconciling via Azure Databricks, is valid and production-grade - especially for large-scale, auditable migrations.
    To ensure reliable hash comparison across data type mismatches, here’s the recommended normalization strategy in Databricks:

    Source Type Target Type Normalization in Databricks
    timestamp datetime2 Use date_format(col, 'yyyy-MM-dd HH:mm:ss.SSSSSS')
    timestamp datetime2 Use date_format(col, 'yyyy-MM-dd HH:mm:ss.SSSSSS')
    integer int Direct compare (cast to int if needed)
    char(15) int Cast to string, then trim leading zeros
    varchar(n) char(n) / nvarchar(n) Trim, lower() (if case-insensitive), pad right to n if needed
    decimal(p,s) numeric(p,s) Use format_number(col, s)
    decimal(10,0) bigint Cast to long
    xml ccsid xml If structure differs, recommend excluding from hash

    Hashing Tips:

    • Use the same algorithm across systems (e.g., MD5 or SHA-256).
    • Normalize all string data (trim, case, padding) before hashing.
    • Concatenate multi-column data with a delimiter like '||' in consistent order.
    • Exclude columns where lossless normalization isn’t feasible (e.g., complex XML).

    I hope this information helps. Please do let us know if you have any further queries.


    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.