Hash calculation strategy for datatypes mismatch

Question

Hash calculation strategy for datatypes mismatch

Janice Chi 140

User's image

In our current project, we are migrating data from an on-premises IBM DB2 system to Azure SQL Hyperscale, using Azure Databricks for transformation and reconciliation. This includes both batch and CDC-based pipelines.

Our project requirement is not just to ingest, but also to perform strict reconciliation between source (DB2) and target (Hyperscale). We have decided to:

Compute row count and row-level hash values inside both DB2 and Hyperscale.

Then bring these hashes and counts into Databricks for comparison and reconciliation.

Questions:

Is this a recommended and reliable approach for reconciliation — calculating row-level hash values inside both source (DB2) and target (Hyperscale), and comparing them in Databricks?

For the following data type transformation pairs (see table below), what is the recommended step-by-step process to ensure that hash values match despite the data type differences? Specifically:

How hash is likely calculated in DB2 vs Hyperscale for each type.

  What adjustments or normalization we should do in Databricks before comparison.
  
     Should we exclude any of these columns from hash if lossless normalization is not possible?

Source Data Type	Target Data Type
timestamp	datetime2
timestamp	datetime2
integer	int
char(15)	int
varchar(16)	char(16)
varchar(50)	nvarchar(50)
char(25)	nchar(25)
varchar(30)	nvarchar(30)
varchar(60)	nvarchar(60)
decimal(12,2)	numeric(12,2)
decimal(10,0)	bigint
decimal(24,12)	numeric(24,12)
xml ccsid	xml
varchar(16)	nvarchar(16)

We would appreciate Microsoft’s guidance on whether this reconciliation method is production-grade, and how to best handle data type mismatches to ensure hash consistency.

Chandra Boorla 14,585 Reputation points Microsoft External Staff Moderator

2025-06-16T21:00:46.05+00:00

@Janice Chi

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Chandra Boorla 14,585 Reputation points Microsoft External Staff Moderator

2025-06-16T21:00:46.05+00:00

@Janice Chi

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Hi @Janice Chi

Yes, the approach of computing row-level hashes on both DB2 and Azure SQL Hyperscale, then reconciling via Azure Databricks, is valid and production-grade - especially for large-scale, auditable migrations.
To ensure reliable hash comparison across data type mismatches, here’s the recommended normalization strategy in Databricks:

Source Type	Target Type	Normalization in Databricks
`timestamp`	`datetime2`	Use date_format(col, 'yyyy-MM-dd HH:mm:ss.SSSSSS')
`timestamp`	`datetime2`	Use `date_format(col, 'yyyy-MM-dd HH:mm:ss.SSSSSS')`
`integer`	`int`	Direct compare (cast to `int` if needed)
`char(15)`	`int`	Cast to `string`, then trim leading zeros
`varchar(n)`	`char(n)` / `nvarchar(n)`	Trim, lower() (if case-insensitive), pad right to n if needed
`decimal(p,s)`	`numeric(p,s)`	Use `format_number(col, s)`
`decimal(10,0)`	`bigint`	Cast to `long`
`xml ccsid`	`xml`	If structure differs, recommend excluding from hash

Hashing Tips:

Use the same algorithm across systems (e.g., MD5 or SHA-256).
Normalize all string data (trim, case, padding) before hashing.
Concatenate multi-column data with a delimiter like '||' in consistent order.
Exclude columns where lossless normalization isn’t feasible (e.g., complex XML).

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Share via

Hash calculation strategy for datatypes mismatch

1 answer

Your answer