Data preview not matching with output file

Anonymous
2022-04-05T16:58:40.113+00:00

Hello,

I am relatively new to using Azure, currently working on a data migration project. Part of that is to extract separate files (mainly data files and header files), and merge them in ADF.

I was able to create a data flow that merged the header files to the data files and sort them so that the header showed as the first row. It had been working fine for a while - the output showed the header at the top and then the data. Recently though, the header no longer shows up as the first row but is randomly inserted in the data.

When I look at the data preview the data seems to show up correctly (see 1st screenshot), but in the output is where I see the differences (see 2nd screenshot - cut off due to sensitive data).

1st screenshot
190225-1st.png

2nd screenshot
190247-2nd.png

The weird thing is that there are other output files that are showing up correctly (use the same data flow - see screenshot 3):

3rd screenshot
190253-4th.png

I have checked that all the configurations are identical to the ones that work but it just doesn't make sense. At this point, I'm reaching out to see if anyone has any suggestions on how to fix this.

4th is a screenshot of the data flow used to merge the header files with the data files and sort the header rows:
190236-3rd.png

Another thing to note: these are not the final (FINAL) outputs, just the first transformation. The actual final output files show up correctly (header at top)...which is very bizarre.

Thanks.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,604 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,031 Reputation points
    2022-04-06T22:07:06.16+00:00

    Hello @Anonymous ,
    Thanks for the question and using MS Q&A platform.

    As we understand the ask here is to preserve row order during union transformation .

    The union transformation uses a SQL union. SQL unions do not guarantee row order by default. To speed things up, rows are added as soon as they are ready. However there is a way around this. Link to relevant stack overflow thread. They give each 'table' a number, then sort after the union.

    Now the question becomes how to implement this in Data Factory Data Flow.

    To add a (temporary) column to order by, we put a Derived Column transformation after each source. Lets call it orderColumn in each. For the value, lets just give a number, 1 for the header, 2 for the body. Assuming things line up during the Union , we just add a Sort transformation (ascending on orderColumn) afterwards. Then discard the orderColumn. Picture below.

    190695-image.png

    Please do let me if you have any queries.

    Thanks
    Martin

    P.S. I know the transformation names break, but I did that to make the picture easier to read.


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

  2. MartinJaffer-MSFT 26,031 Reputation points
    2022-04-07T20:37:22.993+00:00

    Hmm, all I can think of is perhaps a side affect of partitioning, @Anonymous . Keeping everything in 1 single partition should guarantee there is no shuffling after the sort.

    The Sort transformation has an "only sort within partition". Turn that off.
    The Sink transformation has options on how to partition. Assuming your data is small, try single partition.
    The Source transformations have partition options, but I don't think this matters, because the data and header are in separate streams.

    Can you tell me where the header row is showing up? Just checking it isn't something like, at the bottom, which means we got the sorting backwards.