i wants to create hashkey in adf on all the columns except one column, i have 10 different files to process from same data flow. i need hashkey creation for doing delta detection

Manoj kumar 0 Reputation points
2024-07-30T04:30:37.4933333+00:00

how to create hashkey on all columns except one column, i wants to create this hashkey dynamically becoz i have different different files so that i don't need to create multiple dataflow .

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,345 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 29,946 Reputation points
    2024-07-30T07:18:15.3966667+00:00

    You can use a combination of Data Flow activities, derived columns, and expressions :

    1. Create a Data Flow:
      • In your ADF pipeline, add a Data Flow activity.
    2. Add Source Transformation:
      • Add your source datasets to the Data Flow. These datasets should point to the different files you need to process.
    3. Add Derived Column Transformation:
      • This transformation will be used to create the hash key. You will use an expression to concatenate all column values except the one you want to exclude.
    4. Dynamic Columns Concatenation:
      • Use the Data Flow expression language to concatenate all columns except the one you want to exclude.
      • Assuming the column you want to exclude is named ExcludeColumn, and the rest are dynamically retrieved, you can use a combination of functions to achieve this.

    Example Derived Column Expression:

    1. Create a column concatenation expression:
      
         concat(toString(byName('Column1')), toString(byName('Column2')), ..., toString(byName('ColumnN')))
      
      
      You need to dynamically generate this concatenation. Use iif function to exclude ExcludeColumn.
    2. Using ADF Dynamic Content:
      • In your Derived Column transformation, add a new column, say HashKey.
      • Use the following dynamic expression:
        
             hash(concat(
        
               iif(columnExists('Column1') && 'Column1' != 'ExcludeColumn', toString(byName('Column1')), ''),
        
               iif(columnExists('Column2') && 'Column2' != 'ExcludeColumn', toString(byName('Column2')), ''),
        
               ...
        
               iif(columnExists('ColumnN') && 'ColumnN' != 'ExcludeColumn', toString(byName('ColumnN')), '')
        
             ))
        
        
      • The hash() function can be MD5, SHA256, or any hashing algorithm supported by ADF.
    3. Script for Dynamic Generation (Optional):
      • If you have a large number of columns or need a more dynamic approach, you might use ADF expressions or parameters to dynamically construct this expression. However, ADF doesn't directly support scripting inside expressions, so you might need to prepare this outside ADF and use parameters.

  2. Chandra Boorla 10,080 Reputation points Microsoft External Staff
    2024-08-01T18:06:24.17+00:00

    Hi @Manoj kumar

    Thanks for the question and using MS Q&A platform.

    As I understand that you have 10 different files that you need to process using a single Data Flow in Azure Data Factory (ADF). You want to create a hash key for each file that includes all columns except for the last_update_date column, and you need to do this for each file to perform delta detection. However, you also need to include the last_update_date column in the output data.

    To exclude the last_update_date column from the hash key calculation while still including it in the output data, you can follow these steps which might help you:

    Step 1: Add Source Transformation Add a Source transformation to your Data Flow and select the dataset that contains the files you want to process.

    Step 2: Create the Hash Key: In the Derived Column transformation, add a new column called HashKey and use the following expression to concatenate all columns except the last_update_date column:

    hash(concat(
        iif(columnExists('Column1') && 'Column1' != 'last_update_date', toString(byName('Column1')), ''),
        iif(columnExists('Column2') && 'Column2' != 'last_update_date', toString(byName('Column2')), ''),
        ...
        iif(columnExists('ColumnN') && 'ColumnN' != 'last_update_date', toString(byName('ColumnN')), '')
    ))
    

    Replace Column1, Column2, …, ColumnN with the names of the columns you want to include in the hash key calculation, excluding the last_update_date column.

    Step 3: Add Another Derived Column Transformation: Add another Derived Column transformation to your Data Flow.

    Step 4: Include the last_update_date Column: In the second Derived Column transformation, add a new column called HashKeyWithDate and use the following expression to concatenate the hash key from the previous step with the last_update_date column:

    concat(
        toString(byName('HashKey')), 
        toString(byName('last_update_date'))
    )
    

    Replace last_update_date with the name of the column that contains the date you want to include in the output data.

    Step 5: Add a Sink Transformation: Add a Sink transformation to your Data Flow and select the destination dataset where you want to write the output data.

    Step 6: Map the Columns: Map the columns from the source dataset to the appropriate columns in the destination dataset, including the HashKeyWithDate column you created in step 5.

    This approach will allow you to exclude the last_update_date column from the hash key calculation while still including it in the output data.

    I hope this information helps, please do let us know if you have any further queries.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.