Azure Data Factory - Copy Data Task - Upsert not working as intended?

Question

How is an Upsert in Azure Data Factory supposed to work?

This is the source data:

User's image

This is the sink data I get after the first run of the Copy Data Activity, with Upsert activated and the first 2 columns as keys:

User's image

I thought I would end up with just one record per key combination.

Something like this:

79540;20230316;1

79560;20230316;2

79530;20230316;3

It even gets worse if I run the pipeline a second time. I then get this:

User's image

I now have twice as many records as before and at this point, the data is useless and wrong.

This the source tab in the Copy Data Activity:

User's image

This is the query:

SELECT taxonomie_id,dim_date, screenPageViews FROM stg0_ga4.dateScreenPageViews_Page ORDER BY 2 DESC,1 DESC;

This is the sink configuration:

User's image

What am I doing wrong? Do I missunderstand the concept of the upsert functionality?

Thank you very much!

Accepted Answer

Hi @Dirk Sachse,

Allow me to explain the UPSERT for clarity . If we already have a record with the key in the table , its will update the record else INSERT the record

I already have a the below record in the SQL table

Id , Name

1 , 'Himanshu'

Case 1

if the income record is

1 , 'Dirk' and Id is the key column

then in the Table you will have an Update and in the record in table will be

1 , 'Dirk'

Case 2

if the incoming record is

2 , 'Sinha'

Then the record will go as INSERT as there is NO record with the ID =2 , so the table will have 2 rows

Id , Name

1 , 'Himanshu'

2 , 'Sinha'

HTH this helps a bit :)

Focus on your query and output , I think you can make in the transformation in the SQL query itself .

create table dbo.Dirk

(

taxonomie_id int

,dm_date int

,ScreenPageViews int

)

INSERT INTO dbo.Dirk values (79540,20230316,1)

INSERT INTO dbo.Dirk values (79560,20230316,1)

INSERT INTO dbo.Dirk values (79560,20230316,2)

INSERT INTO dbo.Dirk values (79530,20230316,1)

INSERT INTO dbo.Dirk values (79530,20230316,3)

WITH table_cte AS (
  select taxonomie_id  
,dm_date  
,ScreenPageViews
,Row_number() OVER(order BY taxonomie_id,dm_date) Row_number
from 
dbo.Dirk)

select taxonomie_id,dm_date,ScreenPageViews from table_cte 
where ROW_NUMBER in (
select max(Row_number) from table_cte group by taxonomie_id  
,dm_date  )
order by taxonomie_id,dm_date,ScreenPageViews

Output

taxonomie_id dm_date ScreenPageViews

79530 20230316 3

79540 20230316 1

79560 20230316 2

Let me know if you have any queries .

Thanks
Himanshu

Answer

Hi @Dirk Sachse

Upsert logic works like this -

Suppose you have a table with columns id, department, salary and the composite keys are id and department. That means, each pair of id and department will uniquely identify one row.

Example:

id Department Salary

111 100000

1001 112 150000

1002 111 100000

In the above case, id:Deparment pair is unique. There is no duplicate. So when we are seeing new id:Department pair in input, it will be inserted. Else if the source brings a pair of id:department which is already there in sink, it will simply update other columns corresponding to the id:department.

So, If we consider the above table as sink and the below as source,

id,department,Salary

1001,111,150000

1003,111,200000

then row 1 would be updated, row 2 would be inserted.

For better understanding, see below picture

WhatsApp Image 2023-03-18 at 14.22.14

For more info, you could check this video:https://www.youtube.com/watch?v=fegEN1Z1viM

For any other query, please let us know.

Regards,

Suba

Azure Data Factory - Copy Data Task - Upsert not working as intended?

1 additional answer