Copy activity data lake merge copy behavior

Pedro Fiadeiro 216 Reputation points
2021-09-10T15:11:32.027+00:00

Hi,

I have a question regarding the copy behavior option merge when it comes to use data lake as a sink in a copy activity.

Let's assume we have the following 2 csv files

File 1:
1, John
2, Sarah

File 2:
3, Bob
4, Janet

If I pick these 2 files as my source and try to merge them into a file named FinalFile.csv, I understand that the rows may not be in a sequential order. For example, having something like:

1, John
3, Bob
2, Sarah
4, Janet

My understanding is that the merge behavior doesn't really guarantee that files are merged in a sequential manner and records may be mixed.

Let's now say that I'm picking File2 as my source and File1 as my sink. I'd kind of expect that the final output would be:

1, John
2, Sarah
3, Bob
4, Janet

since we're using a file with content already in it and just merging one source file. However, it does seem that only rows from File2 will be there and rows that were on the file are gone. Basically, it seems that it does an overwrite and not an append.

Question: based on what I mentioned above, is there any way to guarantee the order of the records using the copy activity? I'm aware that I could potentially use a data flow or an Azure Function to achieve this but don't really want to go down that route unless there's no other option. I guess one potential option (which I haven't tried yet) is to provide a file with a list of file to the source of the copy activity and set the degree of parallelism to 1, hoping it would merge files by the order they're in the file with the filenames. However, not really keen on this option.

Thanks
Pedro

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,197 questions
0 comments No comments
{count} votes

Accepted answer
  1. Saurabh Sharma 23,791 Reputation points Microsoft Employee
    2021-09-13T23:01:51.73+00:00

    Hi @Pedro Fiadeiro ,

    Thanks for using Microsoft Q&A !!
    This is unfortunately not possible with copy activity and you may need to go through Azure Data flow/Azure Function to sort the records before writing to sink. Also, you could try your copy activity with parallelism if you know the file will be picked in order along with the sorted records.

    Thanks
    Saurabh


0 additional answers

Sort by: Most helpful