Data Flow with ADLS source missed some partitions

Robinson, Andrew 0

We're using Data Factory to synchronise data from Synapse Link Data Lake with Azure SQL database. When the process runs, a mapping Data Flow is called for each table. This filters rows based on the maximum SinkModifiedOn value from the previous run.

Database users informed us that rows on one of the tables were out of date - we traced this back to a particular run of the Data Flow. I ran this in debug mode to try and simulate what it did:

In debug mode, the data preview returned 172 rows. When it ran for real, it picked up only 139 rows. Drilling down into the run statistics I can see that yearly partitions 2017 to 2023 account for these 139 rows. So it appears that partitions 2009 to 2016 have not been picked up although there are changes in these partitions. No errors were reported by the Data Flow or the ADF Pipeline that triggers it.

Why would this be? I have not seen this issue on any other Data Flow runs for this table.

Bhargava-MSFT 29,266 Reputation points Microsoft Employee

2023-07-13T22:02:41.1766667+00:00

Hello Robinson, Andrew,

Welcome to the Microsoft Q&A platform.

The issue looks strange.

Did you check the dataflow monitor tab to see whether only 139 rows are copied to sink?

Can you please check if the dataflow has any other transformations like inner join, alter row, etc, to see if any transformation reduces the rows?

Also, please check if there is any filter that excludes some rows based on a condition.
Bhargava-MSFT 29,266 Reputation points Microsoft Employee

2023-07-14T17:28:30.14+00:00

Hello Robinson, Andrew,

I am checking to see if you got a chance to look into the above response. Please let me know if you have any further questions.
Robinson, Andrew 0 Reputation points

2023-07-17T08:41:07.4166667+00:00

Did you check the dataflow monitor tab to see whether only 139 rows are copied to sink?

105 rows were updated in the sink, but this is expected behaviour (see below)

Can you please check if the dataflow has any other transformations like inner join, alter row, etc, to see if any transformation reduces the rows?

We are using "Append only" in the Synapse Link, so there may be >1 row for each Id in the incoming data. We filter these out by picking the row with the latest VersionNumber for each Id. Hence the 139 incoming rows are reduced to 105.

Also, please check if there is any filter that excludes some rows based on a condition.

As mentioned, we store the maximum SinkModifiedOn each time a table is updated. This is used in the next run to filter out rows we have already processed.

The Data Flow is working as it should (as far as I can see). The issue is the data being presented to it.
Bhargava-MSFT 29,266 Reputation points Microsoft Employee

2023-07-17T19:52:16.7+00:00

Based on the details, here are the possible root causes:

as you have mentioned that the maximum SinkModifiedOn each time a table is updated, and this is used in the next run to filter out rows that have already been processed. It is possible that the missing rows were already processed in a previous run and were therefore not picked up in the current run.
(or) missing rows might not have the latest version number
(or) could be an issue with the filter logic that is causing the rows to be excluded

(or) an issue with source data in the Synapse Link Data Lake for the relevant partitions (2009 to 2016). Please verify if there are indeed changes or updates present in those partitions.

If this doesn't help, I suggest filing a support case to troubleshoot the issue further. If you don't have a support plan, please let me know I can enable a one-time free support request to work on this issue.
Robinson, Andrew 0 Reputation points

2023-07-18T08:44:48.0933333+00:00

It is possible that the missing rows were already processed in a previous run and were therefore not picked up in the current run.

(or) missing rows might not have the latest version number
(or) could be an issue with the filter logic that is causing the rows to be excluded

The Data Flow runs hourly. I re-ran it in debug mode simulating not only the run where the issue occurred, but also the 2 runs before and the 2 runs after (using exactly the same filter dates as the originals). In each case I got back exactly the number of rows I was expecting. The row counts matched with those shown in the original Data Flow runs (except for the one with the issue, obviously).

(or) an issue with source data in the Synapse Link Data Lake for the relevant partitions (2009 to 2016). Please verify if there are indeed changes or updates present in those partitions.

There were definitely changes in the 2009 partition. I could see these when I re-ran in debug mode but the partition counts indicate they were not presented to the original Data Flow run.

I will try raising a Support Case, but the reason I posted the issue on here is that I have had negative experiences doing this in the past.
Bhargava-MSFT 29,266 Reputation points Microsoft Employee

2023-07-18T18:58:32.09+00:00

Thanks for the additional details. Sorry, It's hard to pinpoint the root cause of the issue here. Support engineers can troubleshoot the issue by further looking into the logs from the backend and the pipeline details. I hope they can find the root cause.

Share via

Data Flow with ADLS source missed some partitions