Greetings!
The problem of reduced distinct row counts when importing data from an SFTP server into Azure Data Factory (ADF) is related to how ADF handles chunking during the copy activity. The file from the SFTP server has 38 million rows, but only 4.6 million distinct rows are observed after the import due to potential duplication or loss of rows during the chunking process.
To mitigate this, ADF users can disable chunking by setting "disableChunking" to true. This configuration processes the file as a single chunk, reducing the chances of row duplication or loss. However, this approach results in slower performance, making it impractical for very large datasets.
The Azure product team has confirmed that this behavior is expected with the current SDK and has plans to release an updated SDK to address performance issues. In the interim, users are advised to continue using the "disableChunking" option or consider breaking down the file into smaller segments or using Parquet files to enhance performance.
For additional guidance, the following resources may be helpful:
Resources:
Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.
Please do not forget to "up-vote" wherever the information provided helps you, as this can be beneficial to other community members.