Hi @Janice Chi
Yes, the approach of computing row-level hashes on both DB2 and Azure SQL Hyperscale, then reconciling via Azure Databricks, is valid and production-grade - especially for large-scale, auditable migrations.
To ensure reliable hash comparison across data type mismatches, here’s the recommended normalization strategy in Databricks:
Source Type | Target Type | Normalization in Databricks |
---|---|---|
timestamp |
datetime2 |
Use date_format(col, 'yyyy-MM-dd HH:mm:ss.SSSSSS') |
timestamp |
datetime2 |
Use date_format(col, 'yyyy-MM-dd HH:mm:ss.SSSSSS') |
integer |
int |
Direct compare (cast to int if needed) |
char(15) |
int |
Cast to string , then trim leading zeros |
varchar(n) |
char(n) / nvarchar(n) |
Trim, lower() (if case-insensitive), pad right to n if needed |
decimal(p,s) |
numeric(p,s) |
Use format_number(col, s) |
decimal(10,0) |
bigint |
Cast to long |
xml ccsid |
xml |
If structure differs, recommend excluding from hash |
Hashing Tips:
- Use the same algorithm across systems (e.g., MD5 or SHA-256).
- Normalize all string data (trim, case, padding) before hashing.
- Concatenate multi-column data with a delimiter like
'||'
in consistent order. - Exclude columns where lossless normalization isn’t feasible (e.g., complex XML).
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.