Concept Ideas -> Big Table on Source where Rows are getting deleted and Target is DWH

Jörg Lang 120 Reputation points

Hi there,

I need to have some idea related to "efficient data load".

On source side, all tables has an primary key, created and updated timestamps.
Therefor we have implemented on each pipeline an incremental load logic so that we are only processing new/updated data on each run, which makes sense related to performance and processed data amount.

We have now identified one table, where the not changeable fact exists, that rows are physically deleted, so no "deleted flag" which would be an update, but really a "delete from ...".

The only options I see:

  • full processing of the table on each load
  • some "trigger" (it's an old sybase db) on source db to track PK for deleted rows, and process them seperatly

Any other usefull ideas?



Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,149 questions
{count} votes

Accepted answer
  1. Nandan Hegde 31,346 Reputation points MVP


    No need to do a full load processing as it would increase the amount of data size (ingress and egress).

    Below can be an approach:

    1. Do incremental loads for updates/inserts
    2. For Delete , follow the below process:

    Get list of distinct business key from the source table (In case if count is less than 5000 meaning more than 5k records would never be deleted in 2 isnatnces of run) then use lookup activity and pass that output to a script activity with sink as the connection to delete those.

    If records are greater than 5k, then use copy activity to copy the disinct records within a staging table and then use stored procedure activity that would comapre the records in sink not in staging table and delete those.

    This would ensure end to end data integrity is the same pipeline rather than seperate triggers for deletes

0 additional answers

Sort by: Most helpful