i wants to create hashkey in adf on all the columns except one column, i have 10 different files to process from same data flow. i need hashkey creation for doing delta detection

Question

i wants to create hashkey in adf on all the columns except one column, i have 10 different files to process from same data flow. i need hashkey creation for doing delta detection

Manoj kumar 0

how to create hashkey on all columns except one column, i wants to create this hashkey dynamically becoz i have different different files so that i don't need to create multiple dataflow .

Chandra Boorla 14,665 Reputation points Microsoft External Staff Moderator

2024-08-02T17:02:15.2766667+00:00

Hi @Manoj kumar

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

2 answers

Your answer

Chandra Boorla 14,665 Reputation points Microsoft External Staff Moderator

2024-08-02T17:02:15.2766667+00:00

Hi @Manoj kumar

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

You can use a combination of Data Flow activities, derived columns, and expressions :

Create a Data Flow:
- In your ADF pipeline, add a Data Flow activity.
Add Source Transformation:
- Add your source datasets to the Data Flow. These datasets should point to the different files you need to process.
Add Derived Column Transformation:
- This transformation will be used to create the hash key. You will use an expression to concatenate all column values except the one you want to exclude.
Dynamic Columns Concatenation:
- Use the Data Flow expression language to concatenate all columns except the one you want to exclude.
- Assuming the column you want to exclude is named ExcludeColumn, and the rest are dynamically retrieved, you can use a combination of functions to achieve this.

Example Derived Column Expression:

Create a column concatenation expression:
```
   concat(toString(byName('Column1')), toString(byName('Column2')), ..., toString(byName('ColumnN')))
```
You need to dynamically generate this concatenation. Use iif function to exclude ExcludeColumn.

Using ADF Dynamic Content:

In your Derived Column transformation, add a new column, say HashKey.

Use the following dynamic expression:


     hash(concat(

       iif(columnExists('Column1') && 'Column1' != 'ExcludeColumn', toString(byName('Column1')), ''),

       iif(columnExists('Column2') && 'Column2' != 'ExcludeColumn', toString(byName('Column2')), ''),

       ...

       iif(columnExists('ColumnN') && 'ColumnN' != 'ExcludeColumn', toString(byName('ColumnN')), '')

     ))

The hash() function can be MD5, SHA256, or any hashing algorithm supported by ADF.

Script for Dynamic Generation (Optional):
- If you have a large number of columns or need a more dynamic approach, you might use ADF expressions or parameters to dynamically construct this expression. However, ADF doesn't directly support scripting inside expressions, so you might need to prepare this outside ADF and use parameters.

Manoj kumar 0 Reputation points

2024-07-30T08:16:02.5933333+00:00

Thanks for the reply, let me clarify my question >>i have 10 files and have one dataflow for creating hashkey on all the columns using single data source , now earlier i was using derived column method for generating hashkeys using sha2(256,columns)), but now my all datasets(files) have last_update_date column that i wants to exclude while creating hashkey. apart from that i need that column in data also , so i can't completely neglect last_update_date column. how to do that

Answer 2

Hi @Manoj kumar

Thanks for the question and using MS Q&A platform.

As I understand that you have 10 different files that you need to process using a single Data Flow in Azure Data Factory (ADF). You want to create a hash key for each file that includes all columns except for the last_update_date column, and you need to do this for each file to perform delta detection. However, you also need to include the last_update_date column in the output data.

To exclude the last_update_date column from the hash key calculation while still including it in the output data, you can follow these steps which might help you:

Step 1: Add Source Transformation Add a Source transformation to your Data Flow and select the dataset that contains the files you want to process.

Step 2: Create the Hash Key: In the Derived Column transformation, add a new column called HashKey and use the following expression to concatenate all columns except the last_update_date column:

hash(concat(
    iif(columnExists('Column1') && 'Column1' != 'last_update_date', toString(byName('Column1')), ''),
    iif(columnExists('Column2') && 'Column2' != 'last_update_date', toString(byName('Column2')), ''),
    ...
    iif(columnExists('ColumnN') && 'ColumnN' != 'last_update_date', toString(byName('ColumnN')), '')
))

Replace Column1, Column2, …, ColumnN with the names of the columns you want to include in the hash key calculation, excluding the last_update_date column.

Step 3: Add Another Derived Column Transformation: Add another Derived Column transformation to your Data Flow.

Step 4: Include the last_update_date Column: In the second Derived Column transformation, add a new column called HashKeyWithDate and use the following expression to concatenate the hash key from the previous step with the last_update_date column:

concat(
    toString(byName('HashKey')), 
    toString(byName('last_update_date'))
)

Replace last_update_date with the name of the column that contains the date you want to include in the output data.

Step 5: Add a Sink Transformation: Add a Sink transformation to your Data Flow and select the destination dataset where you want to write the output data.

Step 6: Map the Columns: Map the columns from the source dataset to the appropriate columns in the destination dataset, including the HashKeyWithDate column you created in step 5.

This approach will allow you to exclude the last_update_date column from the hash key calculation while still including it in the output data.

I hope this information helps, please do let us know if you have any further queries.

Chandra Boorla 14,665 Reputation points Microsoft External Staff Moderator

2024-08-05T12:40:16.32+00:00

Hi @Manoj kumar

Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

i wants to create hashkey in adf on all the columns except one column, i have 10 different files to process from same data flow. i need hashkey creation for doing delta detection

2 answers

Your answer