Calculation of hash values

Question

Calculation of hash values

Janice Chi 140

In my project I need to divide large tables up to 17 tb each into many partitions and then need to migrate from source to target and then recon now

q1. what will be the best strategy to calculate hash , partition level or row level ? what is the difference between these two

q2. if we calculte hash on partition level , does it create every row level hash inside the partition and then concatente to give the hash of partition ?

1 answer

Your answer

Answer 1

Vinodh247 34,661 MVP Volunteer Moderator

Hi ,

Thanks for reaching out to Microsoft Q&A.

Q1: What is the best strategy to calculate hash, partition level or row level? What is the difference?

Row-Level Hashing:

Definition: Generate a hash (MD5/SHA256) for each individual row based on key columns or the entire row.
Use Case: Useful for row-by-row comparison during reconciliation or change detection.
- Pros:
  - Granular verification
  - Precise identification of mismatches
- Cons:
  - Expensive to compute for very large datasets
  - Large number of hash values to compare
  - Not efficient for bulk verification

Partition-Level Hashing:

Definition: Generate a single hash per partition by combining hashes of all rows in that partition.

Use Case: Ideal for high-level validation or bulk verification of partitions during migration.
- Pros:
  - Efficient and scalable
  - Fewer comparisons (1 per partition)
  Cons:
```
  Cannot identify which row is different if hashes do not match

        Relies on deterministic row ordering
```

Best Strategy:

Use a hybrid approach:

First: Use partition-level hashing to quickly verify if the partition is consistent.

  Then: If mismatch detected, fall back to row-level hash comparison within the mismatched partition.

This is a tiered strategy: fast to validate, detailed when required.

Q2: If we calculate hash on partition level, does it create every row-level hash inside the partition and then concatenate to give the hash of partition?

Yes, typically. There are two common ways partition-level hashes are computed:

Row Hash Aggregation Approach (recommended):

Compute a hash for each row.

Sort rows by primary key or deterministic order.

Concatenate row hashes (or use a rolling hash) to create a single hash for the partition.

- Example:

  ```sql
  
  SELECT HASH_AGG(HASHBYTES('SHA2_256', CONCAT(col1, col2, col3))) AS partition_hash

FROM partition_data ORDER BY primary_key; ```

Direct Aggregation Approach (risky):
- Aggregate columns directly using group-level functions (CHECKSUM_AGG, HASHBYTES on concatenated strings).
- Less reliable due to data ordering and collisions.
- Example:
```
      
      SELECT HASHBYTES('SHA2_256', STRING_AGG(CONCAT(col1, col2, col3), '||')) AS partition_hash
```

FROM partition_data; ```

Note: Always ensure deterministic row order before concatenation to avoid false mismatches due to row reordering.

Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

J N S S Kasyap 3,625 Reputation points Microsoft External Staff Moderator

2025-05-27T11:32:30.1533333+00:00
@Janice Chi
In addition to @Vinodh247, I am providing the comment in pyspark(Databricks) perspective:

what will be the best strategy to calculate hash , partition level or row level ? what is the difference between these two

Row-level hashing involves computing a hash for each individual row using either selected key columns or the entire row.

This approach is ideal when granular reconciliation is needed, such as identifying mismatched records between source and target systems or verifying changes in data for Change Data Capture (CDC) scenarios.
Row-level hashing is highly precise, making it ideal for detecting exact mismatches and performing detailed data integrity checks. However, this precision comes at a cost it can be computationally expensive and time-consuming, especially for large datasets.

Additionally, storing individual row-level hashes may lead to significant storage overhead.

Pyspark example:

from pyspark.sql.functions import sha2, concat_ws df_with_hash = df.withColumn("row_hash", sha2(concat_ws("||", *df.columns), 256))

Partition-level hashing involves computing a single hash value for each partition by aggregating or combining the hashes of all rows within that partition.
This approach is best suited for high-level reconciliation tasks, such as verifying whether an entire partition (e.g., by date or region) was successfully copied or migrated.
It is significantly faster than row-level hashing and highly efficient for validating large datasets, making it ideal for scenarios involving terabytes of data.
While efficient, it lacks precision. It cannot identify individual row-level mismatches if a partition-level hash comparison fails.

General approach would be Row-level hashes are first calculated within each partition, then aggregated using functions like sum, avg, or sha2(concat(...)) to produce a single partition-level hash.
Pyspark for example

from pyspark.sql.functions import sha2, concat_ws, collect_list row_hashed_df = df.withColumn("row_hash", sha2(concat_ws("||", *df.columns), 256)) partition_hashed_df = row_hashed_df.groupBy("partition_column").agg(sha2(concat_ws("||", collect_list("row_hash")), 256).alias("partition_hash"))

if we calculte hash on partition level , does it create every row level hash inside the partition and then concatenate to give the hash of partition ?

The best practice is to first compute row-level hashes and then aggregate them to derive a partition-level hash.
This approach allows for efficient validation by summarizing the contents of each partition, helping to identify mismatches without the need for row-by-row comparison during large-scale data migration checks.

I hope this info helps, If you have any further queries, I am happy help further.
Janice Chi 140 Reputation points

2025-05-27T15:12:17.72+00:00

As part of our ongoing data migration initiative from DB2 to Azure SQL Hyperscale, we are currently evaluating our reconciliation strategy across the landing zone (ADLS Gen2) and the final destination (Hyperscale SQL). A specific area where we seek expert input is the handling of columns whose data types change during migration.For example:

A CHAR(10) or VARCHAR(100) field in DB2 may be converted to VARCHAR or NVARCHAR in Hyperscale.

Date/time fields stored as strings (e.g., '2023-01-01') in DB2 or CSV might be converted into native DATE or DATETIME types.

Decimal values with padding ('000000123.00') may be stored as NUMERIC(10,2) or BIGINT in the target system.

Our concern is that if these transformed columns are included directly in row-level hash calculations, it may lead to false mismatches due to formatting or type representation differences—even if the actual business value is the same. so what we should then for recon of each row where these columns ar ethere where there is data type chnage between db2 and hyperscale
J N S S Kasyap 3,625 Reputation points Microsoft External Staff Moderator

2025-05-28T08:29:24.1833333+00:00

@Janice Chi
Please help me with below questions We can help us design a reliable reconciliation process

What kind of tech stack are you using in your environment (ADF, DATABRICKS, etc.)?

Do you prefer using PySpark, SQL, or ADF for reconciliation?

Are you comparing data in ADLS (Delta/CSV) vs. Azure SQL Hyperscale, or intermediate vs. final targets?

Are there any nullable fields or default values introduced in the target?

Is partitioning uniform across source and target?

How many rows per partition on average?

Janice Chi 140

✅ Q1: What kind of tech stack are you using in your environment (ADF, DATABRICKS, etc.)?

Answer:

We are using the following tech stack:

Source: IBM DB2 (on-prem, snapshot-based via FlashCopy)

Landing Zone: Azure Data Lake Storage Gen2 (CSV/Parquet format)

Ingestion: Azure Data Factory (via Self-Hosted IR)

Processing + Reconciliation: Azure Databricks (PySpark-based)

Target: Azure SQL Hyperscale

✅ Q2: Do you prefer using PySpark, SQL, or ADF for reconciliation?

Answer:

We are using Databricks (PySpark) as our primary engine for reconciliation, with the following strategy:

We do not use ADF for reconciliation logic — it's only for ingestion and orchestration.

✅ Q3: Are you comparing data in ADLS (Delta/CSV) vs. Azure SQL Hyperscale, or intermediate vs. final targets?

Answer:

We are comparing data at multiple stages:

our reconciliation and retry framework across the following hops:

our reconciliation and retry framework across the following hops:

DB2 Snapshot → ADLS Gen2 (raw data integrity)

ADLS Gen2 → Azure SQL Hyperscale (load success and value-level validation)

DB2 Snapshot → Azure SQL Hyperscale (end-to-end reconciliation)

We plan to perform both row count and row-level hash value reconciliation, and trigger retries in case of mismatch.

✅ Q4: Are there any nullable fields or default values introduced in the target?

Answer:

Yes — and this is a known concern for us.

Area	Detail
Nullable Columns	Some DB2 fields marked as NOT NULL are now NULLABLE in Hyperscale for schema compatibility
Nullable Columns	Some DB2 fields marked as NOT NULL are now NULLABLE in Hyperscale for schema compatibility
Default Values	In some cases, fields like `claim_status` or `amount_paid` may have default values in DB2 but not in Hyperscale
Handling Strategy	We use `COALESCE()` to normalize values before hashing and reconcile NULL-to-default cases using specific rules in field-level recon logic

✅ Q5: Is partitioning uniform across source and target?

Answer:

Not always. Partitioning exists in DB2 but is not preserved 1:1 in Hyperscale due to:

Format differences (CSV vs. relational)

Storage engine behavior

Lack of explicit table partitioning in Azure SQL

However:

We apply logical partitioning based on claim_month, billing_year, etc. during ingestion

Reconciliation is done using partition_column filters to mimic source partitions

✅ Q6: How many rows per partition on average?

Answer:

We currently estimate:

Domain	Avg. Rows per Partition
D1	15M – 30M rows (monthly)
D2	15M – 30M rows (monthly)
D3	8M – 12M rows
D4	4M – 8M rows
Other domains	100K – 2M rows (smaller)

Partitions are monthly or quarterly, depending on the table.

J N S S Kasyap 3,625 Reputation points Microsoft External Staff Moderator

2025-05-29T09:57:22.0566667+00:00
@Janice Chi

Step-by-step strategy for implementing hash-based reconciliation in your Azure Data Migration project (DB2 → ADLS → Azure SQL Hyperscale) using Azure Databricks and ADF

Ingest Data from DB2 to ADLS

Use Azure Data Factory with Self-Hosted IR to extract DB2 snapshot data.

Store data in ADLS Gen2 in Parquet or CSV format.

Preserve logical partitioning columns like claim_month, billing_year, etc.

Load Source Data into Databricks

Read the ingested data from ADLS into a Spark DataFrame.

Normalize the data to match target schema (e.g., handle nulls, trim strings, align data types)

Generate Row-level Hash

For each row, generate a hash using selected business-critical columns.

This provides fine-grained visibility to detect record-level mismatches.

Generate Partition-level Hash

Group rows by logical partition keys (e.g., claim_month).

Aggregate all row-level hashes within each partition and compute a final hash per partition.

This provides a fast, high-level check for data integrity across partitions.

Repeat Steps 2–4 for Target Data (Azure SQL Hyperscale)

Read the target table into Databricks (via JDBC or Synapse connector).

Apply the same normalization and hash logic as source.

Compare Partition level-Hashes

Join source and target partition hashes.

Identify partitions where the hash doesn’t match.

Mark those partitions for detailed row-level comparison or retry.

Compare Row-Level Hashes for Mismatched Partitions

Within mismatched partitions, compare row-level hashes using primary keys (e.g., claim_id).

Identify specific rows with mismatches between source and target.

Trigger Retry for Failed Partitions or Rows

Reprocess only the failed partitions or rows using ADF or Databricks logic.

Use partition filters or primary keys to reload failed data.

Log Reconciliation Results

Record results in a reconciliation log table or file.

Include table name, partition key, source/target row count, hash match status, retry status, and timestamp.

Automate the Reconciliation Workflow

Create a Databricks Job or Workflow to run reconciliation end-to-end.

Integrate with ADF or Logic Apps to schedule execution and send alerts on mismatches.

I hope this information helps. Please do let us know if you have any further queries.
J N S S Kasyap 3,625 Reputation points Microsoft External Staff Moderator

2025-05-30T07:34:34.2433333+00:00

@Janice Chi
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Share via

Calculation of hash values

1 answer

Your answer