Troubleshooting Duplicate Rows in ADF Copy Activity from SFTP Server

Question

Troubleshooting Duplicate Rows in ADF Copy Activity from SFTP Server

Smaran Thoomu 24,260 Microsoft External Staff Moderator

How can we resolve the issue of receiving only 4.6 million distinct rows instead of 38 million when importing data from an SFTP server into Azure Data Factory?

PS - Based on common issues that we have seen from customers and other sources, we are posting these questions to help the Azure community.

1 answer

Your answer

Answer 1

Greetings!

The problem of reduced distinct row counts when importing data from an SFTP server into Azure Data Factory (ADF) is related to how ADF handles chunking during the copy activity. The file from the SFTP server has 38 million rows, but only 4.6 million distinct rows are observed after the import due to potential duplication or loss of rows during the chunking process.

To mitigate this, ADF users can disable chunking by setting "disableChunking" to true. This configuration processes the file as a single chunk, reducing the chances of row duplication or loss. However, this approach results in slower performance, making it impractical for very large datasets.

The Azure product team has confirmed that this behavior is expected with the current SDK and has plans to release an updated SDK to address performance issues. In the interim, users are advised to continue using the "disableChunking" option or consider breaking down the file into smaller segments or using Parquet files to enhance performance.

For additional guidance, the following resources may be helpful:

Resources:

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

Please do not forget to "up-vote" wherever the information provided helps you, as this can be beneficial to other community members.

Share via

Troubleshooting Duplicate Rows in ADF Copy Activity from SFTP Server

1 answer

Your answer