Troubleshooting Duplicate Rows in ADF Copy Activity from SFTP Server

Smaran Thoomu 24,260 Reputation points Microsoft External Staff Moderator
2024-07-31T15:10:15.22+00:00

How can we resolve the issue of receiving only 4.6 million distinct rows instead of 38 million when importing data from an SFTP server into Azure Data Factory?

PS - Based on common issues that we have seen from customers and other sources, we are posting these questions to help the Azure community.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,639 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 24,260 Reputation points Microsoft External Staff Moderator
    2024-07-31T15:11:03.18+00:00

    Greetings!

    The problem of reduced distinct row counts when importing data from an SFTP server into Azure Data Factory (ADF) is related to how ADF handles chunking during the copy activity. The file from the SFTP server has 38 million rows, but only 4.6 million distinct rows are observed after the import due to potential duplication or loss of rows during the chunking process.

    To mitigate this, ADF users can disable chunking by setting "disableChunking" to true. This configuration processes the file as a single chunk, reducing the chances of row duplication or loss. However, this approach results in slower performance, making it impractical for very large datasets.

    The Azure product team has confirmed that this behavior is expected with the current SDK and has plans to release an updated SDK to address performance issues. In the interim, users are advised to continue using the "disableChunking" option or consider breaking down the file into smaller segments or using Parquet files to enhance performance.

    For additional guidance, the following resources may be helpful:

    Resources:

    Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

    Please do not forget to "up-vote" wherever the information provided helps you, as this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.