ADF Copy Data Activity Dropping Rows with Stored Procedure Sink

81972119 6 Reputation points
2022-10-18T09:02:39.323+00:00

My source is a parquet file (created from a CSV file via data flow), my sink is a stored procedure with user defined table type.

I have fault tolerance and logging enabled on the Copy Data activity and when I run the pipeline in debug the Copy Data activity reports:

"rowsRead": 53303,  
"rowsCopied": 53303,  
"rowsSkipped": 0  

The stored procedure has a simple truncate and insert (from the UDTT) for debug purposes as I am only seeing 4953 rows returned from the UDTT, it it appears that the records may be the last 4953 in the file.

I have tried altering the data types in the UDTT but see no difference. I have added an alternate Copy Data activity with the same sink to just insert the data into new table, this returns the complete file.

We are using SQL procedures as a sink as the processing of the file (bulk\Net etc) is based on other parameters.

We have multiple feeds (>200) using this process (with different source systems) the majority have no issue, so far I have identified 3 with this issue but cannot find the cause.

Any help would be appreciated.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. 81972119 6 Reputation points
    2022-11-01T13:36:15.767+00:00

    Update for anyone experiencing the same issue.

    The ADF Copy Data activity is dynamically setting the "Write Batch Size" setting, so where the source file is over a certain size it is processing in batches, i.e. 75k records is processed as 8 10k batches.

    What is not apparent when you run the pipeline is the fact that the Copy Data activity will pass the data to the SQL Procedure and call the procedure 8 time, this is not reflected in the activity logs or the input output logs for the activity, it feeds back as one single run, i.e. RowsRead & RowsCopied read as 75 k not 8 entries totally 75k.

    This means that if the SQL procedure you are passing through has a delete or truncate step or contains logic that includes derivation on the whole dataset you will get unexpected or incorrect results.

    I have altered our pipeline to copy the data to a transient table, I then call another SQL procedure that acts as a wrapper to pass the data from that table to my original load procedures so that they can operate as designed.

    Regards

    Nigel

    1 person found this answer helpful.
    0 comments No comments

  2. HimanshuSinha 19,527 Reputation points Microsoft Employee Moderator
    2022-10-19T21:14:57.12+00:00

    Hello @NigelMcDonnell-5219,
    Thanks for the question and using MS Q&A platform.
    As we understand the ask here is to know as to why the logs says that the whole tabl eis copied , but in the sick table you see lss records , please do let us know if its not accurate.
    I think you have already tried the second option where you copy the records to a different table , all records get copied ( in this case the SP with truncate is not in play) . I think the way your proc is called is the issues , it is truncating the table and so you only have the data from the last ieration . Are you using a foreach loop ? What is the batch size set ?

    252191-image.png

    Please do let me if you have any queries.
    Thanks
    Himanshu


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

  3. 81972119 6 Reputation points
    2022-11-01T13:37:17.23+00:00

    Update for anyone experiencing the same issue.

    The ADF Copy Data activity is dynamically setting the "Write Batch Size" setting, so where the source file is over a certain size it is processing in batches, i.e. 75k records is processed as 8 10k batches.

    What is not apparent when you run the pipeline is the fact that the Copy Data activity will pass the data to the SQL Procedure and call the procedure 8 time, this is not reflected in the activity logs or the input output logs for the activity, it feeds back as one single run, i.e. RowsRead & RowsCopied read as 75 k not 8 entries totally 75k.

    This means that if the SQL procedure you are passing through has a delete or truncate step or contains logic that includes derivation on the whole dataset you will get unexpected or incorrect results.

    I have altered our pipeline to copy the data to a transient table, I then call another SQL procedure that acts as a wrapper to pass the data from that table to my original load procedures so that they can operate as designed.

    Regards

    Nigel


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.