Azure Data Factory - Data Wrangling with Data Flow - Array bug.

Okker Botes 31 Reputation points
2022-05-23T11:04:00.74+00:00

Azure Data Factory - Data Wrangling with Data Flow - Array bug.

I have a tricky firewall log file to wrangle using Azure data factory. The file consists of 4 tab-separated columns. Date and Time, Source, IP and Data.

The Data column consists of key-value pairs separated with equal signs and text delimited by double-quotes. The challenge is that the data column is inconsistent and contains any number of key-value pair combinations.

Three lines from the source file.

2022-02-13 00:59:59	Local7.Notice	192.168.40.1	date=2022-02-13 time=00:59:59 devname="NoHouse" devid="FG100ETK18006624" eventtime=1644706798637882880 tz="+0200" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="root" srcip=192.168.41.200 srcport=58492 srcintf="port1" srcintfrole="undefined" dstip=216.239.36.55 dstport=443 dstintf="wan1" dstintfrole="undefined" srccountry="Reserved" dstcountry="United States" sessionid=137088638 proto=6 action="client-rst" policyid=5 policytype="policy" poluuid="c2a960c4-ac1b-51e6-8011-6f00cb1fddf2" policyname="All LAN over WAN1" service="HTTPS" trandisp="snat" transip=196.213.203.122 transport=58492 appcat="unknown" applist="block-p2p" duration=6 sentbyte=3222 rcvdbyte=1635 sentpkt=14 rcvdpkt=8 srchwvendor="Microsoft" devtype="Computer" osname="Debian" mastersrcmac="00:15:5d:29:b4:06" srcmac="00:15:5d:29:b4:06" srcserver=0  
2022-02-13 00:59:59	Local7.Notice	192.168.40.1	date=2022-02-13 time=00:59:59 devname="NoHouse" devid="FG100ETK18006624" eventtime=1644706798657887422 tz="+0200" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="root" srcip=192.168.41.200 srcport=58496 srcintf="port1" srcintfrole="undefined" dstip=216.239.36.55 dstport=443 dstintf="wan1" dstintfrole="undefined" srccountry="Reserved" dstcountry="United States" sessionid=137088640 proto=6 action="client-rst" policyid=5 policytype="policy" poluuid="c2a960c4-ac1b-51e6-8011-6f00cb1fddf2" policyname="All LAN over WAN1" service="HTTPS" trandisp="snat" transip=196.213.203.122 transport=58496 appcat="unknown" applist="block-p2p" duration=6 sentbyte=3410 rcvdbyte=1791 sentpkt=19 rcvdpkt=11 srchwvendor="Microsoft" devtype="Computer" osname="Debian" mastersrcmac="00:15:5d:29:b4:06" srcmac="00:15:5d:29:b4:06" srcserver=0  
2022-02-13 00:59:59	Local7.Notice	192.168.40.1	date=2022-02-13 time=00:59:59 devname="NoHouse" devid="FG100ETK18006624" eventtime=1644706798670487613 tz="+0200" logid="0001000014" type="traffic" subtype="local" level="notice" vd="root" srcip=192.168.41.180 srcname="GKHYPERV01" srcport=138 srcintf="port1" srcintfrole="undefined" dstip=192.168.41.255 dstport=138 dstintf="root" dstintfrole="undefined" srccountry="Reserved" dstcountry="Reserved" sessionid=137088708 proto=17 action="deny" policyid=0 policytype="local-in-policy" service="udp/138" trandisp="noop" app="netbios forward" duration=0 sentbyte=0 rcvdbyte=0 sentpkt=0 rcvdpkt=0 appcat="unscanned" srchwvendor="Intel" osname="Windows" srcswversion="10 / 2016" mastersrcmac="a0:36:9f:9b:de:b6" srcmac="a0:36:9f:9b:de:b6" srcserver=0  

My strategy for wrangling this data set is as follows.

  1. Source the data file from azure data Lake Using a tab-delimited csv data set. This successfully delivers the source data in four columns to my data flow.
  2. Add a surrogate key transformation to add an incrementing key value to each row of data.
  3. Add a derived column, with the following function.
    regexSplit(Column_4, '\s(?=(?:[^"](["])[^"]\1)[^"]$)')
    This splits the data by spaces ignoring the spaces between semicolons.
  4. Then the unfold creates a new record for each item in the array while preserving the other column values.
    unfold(SplitBySpace)
  5. Then split the key-value pairs into their represented value and key by the Delimiter equals.
  6. The final step would then be to unpivot the data back into columns with the respected values grouped by the surrogate key added in step 2.

This all sounds good but unfortunately step 5 fails with the following error. “Indexing is only allowed on the array and map types”.

The output after step 4.
204594-outputstep4.jpg

The unfold function returns an array according to the inspect tab, see below. I would expect a string here!!

204610-inspect.jpg

Now in step 5, I split by “=” with the expression split(unfoldSplitBySpace, '=') but this errors in the expression builder with the message “Split expect string type of argument”

Changing the expression to split(unfoldSplitBySpace1, '=') remove the error from the expression Builder.

BUT THEN the spark execution engines errors with “Indexing is only allowed on the array and map types”

204665-error.jpg

The problem.
According to the Azure Data Factory UI, the output of the Unfold() function is an array type but when accessing the array elements or any other function the spark engine does not recognise the object as in array type.

Is this a bug in the execution or do I have a problem in my understanding of how the data factory and a spark engine understand arrays?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,655 questions
{count} votes

Accepted answer
  1. MartinJaffer-MSFT 26,236 Reputation points
    2022-05-25T16:42:26.067+00:00

    So right after the unfolding step and before the SplitByEquals, I put a derived column. In this I tried:

    toString( unfoldIntoRows[0] )  
    

    This gave the index error. Next I tried:

    toString( unfoldIntoRows )  
    

    This previewed find and showed as strings.

    205449-image.png

    Now I expect things to behave in the split by equals step, now that I have a transform dedicated to ensuring the type , whatever craziness is happening.

    205450-image.png

    So, my conclusion is that you found a bug or glitch. I think the type was falsely reported somewhere in the process, but for now explicitly casting to string seems to be a work-around. I'll bring this to the attention of @MarkKromer-MSFT .

    I have attatched my script so you may inspect it, @Okker Botes .

    205582-script-for-type-bug.txt

    Let if know if this was the only blocker, or if I should try the unpivot. BTW, are you sure you don't mean regular pivot?

    1 person found this answer helpful.
    0 comments No comments

3 additional answers

Sort by: Most helpful
  1. Okker Botes 31 Reputation points
    2022-05-26T11:45:06.763+00:00

    Wow, What fantastic support!!!

    I think your regex (the dark art of strings) is much simpler and more functional than mine. To be honest I did not know how mine worked.

    “I'm thinking it is either a bug somewhere, or a strange rule that isn't evident event to me”
    I concur, expecting a fix sone.

    I tried the Stringify component without success. So your toString( unfoldIntoRows ) is brilliantly simple. I only realised that I could write code in the Derived Column component yesterday. Plaining to use this loss in the future.

    I solved my problem with a different approach. I replaced the unfold with a flatten component. Witch achieved the same outcome. But to be honest I'm not exactly sure how this works. :)

    205849-flatten1.jpg

    205863-flaten2.jpg

    I think your solution is cleaner. What's your opinion? I’m wondering which one will be more performant. My biggest log file to process is, currently, 2.46 GiB. So performance is going to be a big factor. ( at present when processing the large file the debug runtime errors with “Failed to execute dataflow with internal server error”. But I will log a new issue for this. )

    “are you sure you don't mean regular pivot?” Yes I mean pivot. Thanks for the offer, but I was successful with this and the sink step. :)

    Thanks again for the amazing support.
    Okker

    1 person found this answer helpful.

  2. Okker Botes 31 Reputation points
    2022-05-30T09:21:30.83+00:00

    This is some of the best support I have ever had on any software product, thank you for your contribution MartinJaffer-MSFT!!!!

    Thanks for your response and for clarifying the difference between Unfold and Flatten.

    I have run some tests on a relatively large data set and found the processing time for both to be comparable. As you suspected the pivot is much more compute-intensive.

    1 person found this answer helpful.
    0 comments No comments

  3. Okker Botes 31 Reputation points
    2022-05-25T08:39:05.717+00:00

    Thanks for taking the time to reply, Martin, it is much appreciated.

    The ask is to split Column4 (the data) into columns with the key as the heading and the values in the rows. the sink will be Azure SQL server.

    I have attached a smaller sample file.
    205432-firewall-logs-20220524-113003-2022-05-24t081338-so.txt


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.