ADF copying Data Flow with Sort outputs unordered records in Sink

Question

ADF copying Data Flow with Sort outputs unordered records in Sink

Khamylov, Oleksandr 0

Hello.

I am trying to build a simple "copying" Pipeline with CosmosDB as Source and Sink.
In order to have capability to copy only deltas on each pipeline run, I want to use Data Flow (with Change feed enabled).

The requirement is also to preserve events order when copying (as copied events will be processed by application change feed processor). So I've put Sort block between Source and Sink with Partition option - Single partition.

Nevertheless the data in the Sink is not in the expected order (Data preview shows expected order).

User's image

I was able to achieve the result with output events ordered as expected only when I set Batch size = 1 in Sink configuration. But the processing speed was extremely low (few records per second).

Is there a way to have ordered events in Sink without a workaround of Batch size = 1 and with reasonable throughput?

Thanks in advance.

KranthiPakala-MSFT 46,737 Reputation points Microsoft Employee Moderator

2023-04-19T21:34:55.2+00:00

@Khamylov, Oleksandr We still have not heard back from you. Just wanted to check if the below information was helpful? If it answers your query, please do click Accept Answer and Yes for "was this answer helpful", as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.

Thank you

1 answer

Your answer

KranthiPakala-MSFT 46,737 Reputation points Microsoft Employee Moderator

2023-04-19T21:34:55.2+00:00

@Khamylov, Oleksandr We still have not heard back from you. Just wanted to check if the below information was helpful? If it answers your query, please do click Accept Answer and Yes for "was this answer helpful", as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.

Thank you

Answer 1

Hi @Khamylov, Oleksandr , My understanding is that you are trying to copy data from CosmosDB to a Sink while preserving the order of events. You have added a Sort block between the Source and Sink with the Partition option set to Single partition. However, the data in the Sink is not in the expected order, even though the data preview shows the expected order. You were able to achieve the expected result only when you set Batch size = 1 in the Sink configuration, but the processing speed was extremely low. You are wondering if there is a way to have ordered events in the Sink without a workaround of Batch size = 1 and with reasonable throughput.

Based on the information you have provided; it seems that you have taken the right approach by using the Sort block to sort the incoming rows on the current data stream. However, as you have mentioned, the data preview shows the expected order, but the data in the Sink is not in the expected order. This could be because the data flow is executed on Spark clusters, which distribute data across multiple nodes and partitions. If you choose to repartition your data in a subsequent transformation, you may lose your sorting due to reshuffling of data. To maintain the sort order in your data flow, as you did, we will have to set the Single partition option in the Optimize tab on the Sort transformation and keep the Sort transformation as close to the Sink as possible. This will ensure that the data is sorted before it is written to the Sink.
In general, it is recommended increasing the Batch size in the Sink configuration to improve the processing speed. However, in contradiction, increasing the Batch size may affect the order of events in the Sink.

I'm reaching out to internal team to see if there are any other workarounds that would help preserve the order of events with improved throughput and will get back to you as soon as I hear back from them.
Thank you for your patience.

KranthiPakala-MSFT 46,737 Reputation points Microsoft Employee Moderator

2023-04-11T01:49:22.8866667+00:00

@Khamylov, Oleksandr - I have a confirmation from product team that, Dataflows does not support writing the data in the sorted order to sink, sorts are there for few scenarios where in memory transformations can make use of sorting order. As I mentioned in my previous response, sorting and single partitions takes away all the performance benefits of spark, basically sort and single partition is forcing all the data to be processed using single process.

If you have any additional feedback regarding the product behavior, please feel free to share it in IDEAS forum here and do share the link back: https://feedback.azure.com/d365community/forum/1219ec2d-6c26-ec11-b6e6-000d3a4f032c

All the feedback shared in this forum are actively monitored and reviewed by respective product owners and will take action as needed.

Please don’t forget to Accept Answer and click Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

Share via

ADF copying Data Flow with Sort outputs unordered records in Sink

1 answer

Your answer