SHA256 produces hashes for columns that don't exist

Jeff vG 86 Reputation points
2021-06-15T04:19:21.363+00:00

In both cases I am deriving a column like this, within a dataflow:

sha2(256, byNames($hashcols))

I define hashcols as the parameter

hashcols = ['FirstName','Date of Birth','Ingest_TS']

The Derive Column step in my dataflow produces hashes, as expected. Whether those are the actual SHA256 hashes seems in question.

That's because, If I change that param to be

hashcols = ['FirstName','Date of Birth','Ingest_Date']

I still get hashed output, but different hashes - even though "Ingest_Date" doesn't exist as a column.

The same problem obtains if I hash the array directly:

sha2(256, byNames(['FirstName','Date of Birth','Ingest_Date']))

How can I have any faith that the actual columns are being used to generate the output hash, and not some string or array of strings? The hashes vary across different inputs, so it doesn't seem like the strings are being used as constants each time - but still the problem remains that a non-existent column is used and producing a hash in some way, and that output varies across inputs. How would I paste the 3 values referenced by $hashcols into something like https://passwordsgenerator.net/sha256-hash-generator/ to verify the correctness of the output?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,587 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Kiran-MSFT 691 Reputation points Microsoft Employee
    2021-06-17T04:14:08.967+00:00

    byNames will return null if a column does not exist. nulls do count towards hash computations and are not ignored. It is not a concatenated string representation of all column values but more of a byte representation of those values.

    1 person found this answer helpful.

  2. MartinJaffer-MSFT 26,081 Reputation points
    2021-06-23T00:07:03.873+00:00

    The original problem was that it was trying to work with bad column names, right @Jeff vG ?

    How about this expression to scrub out all column names which do not exist in a stream?

    mapIf($badnames, in(columnNames('source1'),#item), #item)  
    
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.