Calculation of hash values

Janice Chi 140 Reputation points
2025-05-27T07:17:10.1066667+00:00

In my project I need to divide large tables up to 17 tb each into many partitions and then need to migrate from source to target and then recon now

q1. what will be the best strategy to calculate hash , partition level or row level ? what is the difference between these two

q2. if we calculte hash on partition level , does it create every row level hash inside the partition and then concatente to give the hash of partition ?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,517 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 34,661 Reputation points MVP Volunteer Moderator
    2025-05-27T07:40:45.45+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    Q1: What is the best strategy to calculate hash, partition level or row level? What is the difference?

    Row-Level Hashing:

    • Definition: Generate a hash (MD5/SHA256) for each individual row based on key columns or the entire row.
    • Use Case: Useful for row-by-row comparison during reconciliation or change detection.
      • Pros:
        • Granular verification
        • Precise identification of mismatches
      • Cons:
        • Expensive to compute for very large datasets
        • Large number of hash values to compare
        • Not efficient for bulk verification

    Partition-Level Hashing:

    Definition: Generate a single hash per partition by combining hashes of all rows in that partition.

    • Use Case: Ideal for high-level validation or bulk verification of partitions during migration.
      • Pros:
        • Efficient and scalable
        • Fewer comparisons (1 per partition)
        Cons:
          Cannot identify which row is different if hashes do not match
        
                Relies on deterministic row ordering
        

    Best Strategy:

    Use a hybrid approach:

    First: Use partition-level hashing to quickly verify if the partition is consistent.

      Then: If mismatch detected, fall back to row-level hash comparison within the mismatched partition.
      
    

    This is a tiered strategy: fast to validate, detailed when required.


    Q2: If we calculate hash on partition level, does it create every row-level hash inside the partition and then concatenate to give the hash of partition?

    Yes, typically. There are two common ways partition-level hashes are computed:

    1. Row Hash Aggregation Approach (recommended):

    Compute a hash for each row.

    Sort rows by primary key or deterministic order.

    • Concatenate row hashes (or use a rolling hash) to create a single hash for the partition.
    - Example:
    
      ```sql
      
      SELECT HASH_AGG(HASHBYTES('SHA2_256', CONCAT(col1, col2, col3))) AS partition_hash
    

    FROM partition_data ORDER BY primary_key; ```

    1. Direct Aggregation Approach (risky):
      • Aggregate columns directly using group-level functions (CHECKSUM_AGG, HASHBYTES on concatenated strings).
      • Less reliable due to data ordering and collisions.
      • Example:
              
              SELECT HASHBYTES('SHA2_256', STRING_AGG(CONCAT(col1, col2, col3), '||')) AS partition_hash
        

    FROM partition_data; ```

    Note: Always ensure deterministic row order before concatenation to avoid false mismatches due to row reordering.

    Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.