Azure Data Factory - Copy Data Task - Upsert not working as intended?

Dirk Sachse 106 Reputation points
2023-03-17T21:03:56.8166667+00:00

How is an Upsert in Azure Data Factory supposed to work?

This is the source data:

User's image

This is the sink data I get after the first run of the Copy Data Activity, with Upsert activated and the first 2 columns as keys:

User's image

I thought I would end up with just one record per key combination.

Something like this:

79540;20230316;1

79560;20230316;2

79530;20230316;3

It even gets worse if I run the pipeline a second time. I then get this:

User's image

I now have twice as many records as before and at this point, the data is useless and wrong.

This the source tab in the Copy Data Activity:

User's image

This is the query:

SELECT taxonomie_id,dim_date, screenPageViews FROM stg0_ga4.dateScreenPageViews_Page ORDER BY 2 DESC,1 DESC;

This is the sink configuration:

User's image

What am I doing wrong? Do I missunderstand the concept of the upsert functionality?

Thank you very much!

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,487 questions
0 comments No comments
{count} votes

Accepted answer
  1. HimanshuSinha-msft 19,376 Reputation points Microsoft Employee
    2023-03-22T23:30:24.26+00:00

    Hi @Dirk Sachse,

    Allow me to explain the UPSERT for clarity . If we already have a record with the key in the table , its will update the record else INSERT the record

    I already have a the below record in the SQL table

    Id , Name

    1 , 'Himanshu'

    Case 1

    if the income record is

    1 , 'Dirk' and Id is the key column

    then in the Table you will have an Update and in the record in table will be

    1 , 'Dirk'

    Case 2

    if the incoming record is

    2 , 'Sinha'
    

    Then the record will go as INSERT as there is NO record with the ID =2 , so the table will have 2 rows

    Id , Name

    1 , 'Himanshu'

    2 , 'Sinha'

    HTH this helps a bit :)

    Focus on your query and output , I think you can make in the transformation in the SQL query itself .

    create table dbo.Dirk

    (

    taxonomie_id int

    ,dm_date int

    ,ScreenPageViews int

    )

    INSERT INTO dbo.Dirk values (79540,20230316,1)

    INSERT INTO dbo.Dirk values (79540,20230316,1)

    INSERT INTO dbo.Dirk values (79540,20230316,1)

    INSERT INTO dbo.Dirk values (79560,20230316,1)

    INSERT INTO dbo.Dirk values (79560,20230316,1)

    INSERT INTO dbo.Dirk values (79560,20230316,2)

    INSERT INTO dbo.Dirk values (79530,20230316,1)

    INSERT INTO dbo.Dirk values (79530,20230316,3)

    WITH table_cte AS (
      select taxonomie_id  
    ,dm_date  
    ,ScreenPageViews
    ,Row_number() OVER(order BY taxonomie_id,dm_date) Row_number
    from 
    dbo.Dirk)
    
    select taxonomie_id,dm_date,ScreenPageViews from table_cte 
    where ROW_NUMBER in (
    select max(Row_number) from table_cte group by taxonomie_id  
    ,dm_date  )
    order by taxonomie_id,dm_date,ScreenPageViews 
    

    Output

    taxonomie_id dm_date ScreenPageViews


    79530 20230316 3

    79540 20230316 1

    79560 20230316 2

    Let me know if you have any queries .

    Thanks
    Himanshu

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Suba Balaji 11,186 Reputation points
    2023-03-18T09:04:51.4166667+00:00

    Hi @Dirk Sachse

    Upsert logic works like this -

    Suppose you have a table with columns id, department, salary and the composite keys are id and department. That means, each pair of id and department will uniquely identify one row.

    Example:

    id Department Salary

    1. 111 100000

    1001 112 150000

    1002 111 100000

    In the above case, id:Deparment pair is unique. There is no duplicate. So when we are seeing new id:Department pair in input, it will be inserted. Else if the source brings a pair of id:department which is already there in sink, it will simply update other columns corresponding to the id:department.

    So, If we consider the above table as sink and the below as source,

    id,department,Salary

    1001,111,150000

    1003,111,200000

    then row 1 would be updated, row 2 would be inserted.

    For better understanding, see below picture

    WhatsApp Image 2023-03-18 at 14.22.14

    For more info, you could check this video:https://www.youtube.com/watch?v=fegEN1Z1viM

    For any other query, please let us know.

    Regards,

    Suba