ADF copying Data Flow with Sort outputs unordered records in Sink

Khamylov, Oleksandr 0 Reputation points
2023-04-05T00:41:55.02+00:00

Hello.

I am trying to build a simple "copying" Pipeline with CosmosDB as Source and Sink.
In order to have capability to copy only deltas on each pipeline run, I want to use Data Flow (with Change feed enabled).

The requirement is also to preserve events order when copying (as copied events will be processed by application change feed processor). So I've put Sort block between Source and Sink with Partition option - Single partition.

Nevertheless the data in the Sink is not in the expected order (Data preview shows expected order).

User's image

I was able to achieve the result with output events ordered as expected only when I set Batch size = 1 in Sink configuration. But the processing speed was extremely low (few records per second).

Is there a way to have ordered events in Sink without a workaround of Batch size = 1 and with reasonable throughput?

Thanks in advance.

Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
{count} votes

1 answer

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,737 Reputation points Microsoft Employee Moderator
    2023-04-06T01:30:49.3566667+00:00

    Hi @Khamylov, Oleksandr , My understanding is that you are trying to copy data from CosmosDB to a Sink while preserving the order of events. You have added a Sort block between the Source and Sink with the Partition option set to Single partition. However, the data in the Sink is not in the expected order, even though the data preview shows the expected order. You were able to achieve the expected result only when you set Batch size = 1 in the Sink configuration, but the processing speed was extremely low. You are wondering if there is a way to have ordered events in the Sink without a workaround of Batch size = 1 and with reasonable throughput.

    Based on the information you have provided; it seems that you have taken the right approach by using the Sort block to sort the incoming rows on the current data stream. However, as you have mentioned, the data preview shows the expected order, but the data in the Sink is not in the expected order. This could be because the data flow is executed on Spark clusters, which distribute data across multiple nodes and partitions. If you choose to repartition your data in a subsequent transformation, you may lose your sorting due to reshuffling of data. To maintain the sort order in your data flow, as you did, we will have to set the Single partition option in the Optimize tab on the Sort transformation and keep the Sort transformation as close to the Sink as possible. This will ensure that the data is sorted before it is written to the Sink.
    In general, it is recommended increasing the Batch size in the Sink configuration to improve the processing speed. However, in contradiction, increasing the Batch size may affect the order of events in the Sink.

    I'm reaching out to internal team to see if there are any other workarounds that would help preserve the order of events with improved throughput and will get back to you as soon as I hear back from them.
    Thank you for your patience.


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.