Understanding problems in example from learning path "Work with Data Warehouses using Azure Synapse Analytics" in module "Load data into a relational data warehouse"

Linus Hillemann 0 Reputation points
2024-01-03T21:30:13.1833333+00:00

Hi everybody.

Iam stumbling over the example from the learning path "Work with Data Warehouses using Azure Synapse Analytics" in module "Load data into a relational data warehouse". Under "Load dimension tables" it is described which methods exist to load dimension tables. Specifically it states the following code:

CREATE TABLE dbo.DimProductUpsert
WITH
(
    DISTRIBUTION = REPLICATE,
    CLUSTERED COLUMNSTORE INDEX
)
AS
-- New or updated rows
SELECT  stg.ProductID AS ProductBusinessKey,
        stg.ProductName,
        stg.ProductCategory,
        stg.Color,
        stg.Size,
        stg.ListPrice,
        stg.Discontinued
FROM    dbo.StageProduct AS stg
UNION ALL  
-- Existing rows
SELECT  dim.ProductBusinessKey,
        dim.ProductName,
        dim.ProductCategory,
        dim.Color,
        dim.Size,
        dim.ListPrice,
        dim.Discontinued
FROM    dbo.DimProduct AS dim
WHERE NOT EXISTS
(   SELECT  *
    FROM dbo.StageProduct AS stg
    WHERE stg.ProductId = dim.ProductBusinessKey
);

RENAME OBJECT dbo.DimProduct TO DimProductArchive;
RENAME OBJECT dbo.DimProductUpsert TO DimProduct;

Iam not understanding why updated rows in the staging table are captured in the new created table dbo.DimProductUpsert because rows with same primary keys are filtered out. And an updated row can have the same primary key but a different value for an attribute which should then also be filtered out.

Any help is appreciated :)

Azure SQL Database
0 comments No comments
{count} votes

5 answers

Sort by: Most helpful
  1. Konstantinos Passadis 19,496 Reputation points MVP
    2024-01-04T20:57:38.4766667+00:00

    Hello @Labyrant!

    Thank you for your message

    Lets clarify the whole thing :

    First of all , The SQL code is an example of an "upsert" operation, which is a combination of "update" and "insert." The goal here is to update existing rows in the dbo.DimProduct table with new data from the dbo.StageProduct staging table and to insert any new rows that do not exist in dbo.DimProduct.

    Here's a breakdown of how the code works:

    Create a New Table for Upsert: The dbo.DimProductUpsert table is created as a combination of existing rows and new/updated rows from the staging table.

    New or Updated Rows: The SELECT statement from the dbo.StageProduct table gets all rows, which are considered as new or updated rows. This assumes that the staging table (dbo.StageProduct) contains the latest data.

    Union All with Existing Rows: The UNION ALL operation is used to combine the rows from the staging table with the rows from the existing dimension table (dbo.DimProduct).

    Filter Out Existing Unchanged Rows: The WHERE NOT EXISTS clause filters out rows from dbo.DimProduct that have a matching ProductID in the staging table. This is meant to exclude rows that are unchanged (since it's assumed that the staging table contains the latest data for updates).

    The confusion might arise from the assumption that the staging table contains only new or updated rows. However, this approach assumes that the staging table may contain all rows with the latest data, and the operation aims to:

    • Update existing rows in the dimension table with the data from the staging table.
    • Insert new rows from the staging table that don't exist in the dimension table.
    • Keep unchanged rows from the dimension table.

    This results in dbo.DimProductUpsert having the latest data for all products.

    Finally, the RENAME statements are used to archive the old dimension table by renaming it to DimProductArchive and then renaming the DimProductUpsert to DimProduct, effectively replacing the old dimension table with the new upserted data.

    If an existing row in the dimension table is updated in the staging table (same ProductID, different other values), the UNION ALL operation followed by the WHERE NOT EXISTS filter ensures that the updated row from the staging table is included in the DimProductUpsert, and the old version of the row is not included from the DimProduct.

    This method is often used in data warehousing scenarios where you're combining full refresh and incremental load strategies to maintain the dimension tables in a data warehouse.

    Now your second Question ;

    In the third case, when the row is an updated row in the staging table (dbo.StageProduct), we need to ensure two things:

    1. The updated row from the staging table is included in the final table (dbo.DimProductUpsert).
    2. The old version of the row from the original dimension table (dbo.DimProduct) is excluded from the final table.

    The SQL snippet provided seems to aim to achieve this as follows:

    • It first adds all rows from the staging table to dbo.DimProductUpsert, assuming that the staging table contains the most recent data (both inserts and updates).
    • It then attempts to add all rows from the original dimension table unless a matching row (by ProductID) exists in the staging table.

    Here’s where your question about the third case comes in. The intent of the WHERE NOT EXISTS clause is to prevent the old version of the rows from dbo.DimProduct from being included if an updated version exists in dbo.StageProduct.

    WHERE NOT EXISTS
    (
        SELECT *
        FROM dbo.StageProduct AS stg
        WHERE stg.ProductId = dim.ProductBusinessKey
    )
    

    So, the third case will work as follows:

    • For updated rows, the subquery will return true because a matching ProductID is found in the staging table.
    • The NOT in front of EXISTS will negate this to false.
    • As a result, the old row from dbo.DimProduct will not be included in the final set, which is the intended behavior.

    The final dbo.DimProductUpsert table will then have all new rows, all updated rows (from the staging table), and all unchanged rows from the original dimension table.


    I hope this helped!

    The answer or portions of it may have been assisted by AI Source: ChatGPT Subscription

    Kindly mark the answer as Accepted and Upvote in case it helped or post your feedback to help !

    Regards

    1 person found this answer helpful.
    0 comments No comments

  2. Konstantinos Passadis 19,496 Reputation points MVP
    2024-01-04T04:12:50.31+00:00

    Hello @Linus Hillemann !

    Welcome to Microsoft QnA!

    The reason that updated rows in the staging table are captured in the new created table dbo.DimProductUpsert is because the code uses a UNION ALL operator to combine the rows from the staging table and the existing rows in the dbo.DimProduct table. The UNION ALL operator does not remove duplicates, so it will include all rows from both tables, even if they have the same primary key.

    However, the code also includes a WHERE NOT EXISTS clause to filter out any rows from the dbo.DimProduct table that have the same primary key as a row in the staging table. This ensures that only new or updated rows are inserted into the dbo.DimProductUpsert table.

    To be more specific, the WHERE NOT EXISTS clause checks whether there is a row in the staging table with the same ProductID as a row in the dbo.DimProduct table. If there is no such row, the row from the dbo.DimProduct table is included in the UNION ALL result set. If there is such a row, the row from the dbo.DimProduct table is excluded from the UNION ALL result set.

    https://www.w3schools.com/mySQl/sql_ref_union.asp


    I hope this helps!

    The answer or portions of it may have been assisted by AI Source: ChatGPT Subscription

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards

    0 comments No comments

  3. Smaran Thoomu 23,095 Reputation points Microsoft External Staff Moderator
    2024-01-04T04:52:04.2866667+00:00

    Hi @Labyrant

    Welcome to Microsoft Q&A Platform. Thank you for posting your query.

    In additional to @Konstantinos Passadis response, The query is using a technique called "upsert" which is a combination of update and insert. The query first selects all new or updated rows from the staging table and inserts them into the new table dbo.DimProductUpsert. Then, it selects all existing rows from the dbo.DimProduct table and appends them to the dbo.DimProductUpsert table. Finally, it removes any duplicates from the dbo.DimProductUpsert table based on the primary key.

    Regarding your question about how an updated row can have the same primary key but a different value for an attribute, the primary key is used to identify the row, but it does not prevent the row from being updated. In this case, the query is designed to update the row with the new attribute value, rather than inserting a new row with the same primary key.

    I hope this helps clarify things for you! If you have any further questions, please don't hesitate to ask.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    0 comments No comments

  4. Linus Hillemann 0 Reputation points
    2024-01-04T20:21:20.8266667+00:00

    Alright, got it.

    0 comments No comments

  5. Linus Hillemann 0 Reputation points
    2024-01-05T13:52:58.9+00:00

    Thank you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.