다음을 통해 공유


Assigning surrogate key to dimension tables in SQL DW and APS

Reviewed by: James Rowland-Jones, John Hoang, Denzil Ribeiro, Sankar Subramanian

This article explains how to assign monotonically increasing surrogate/synthetic keys to dimension tables in SQL DW (or APS) using T-SQL. We are going to highlight two possible methods of doing this:

  1. Assign Surrogate keys to the dimension tables where the dimensions are generated in a source system and loaded into SQL DW (or APS).
  2. Extract dimension values from the fact table and assign surrogate keys when the loaded data is such that a single record set has both fact and dimension values.

Background

At the time of the writing of this blog, SQL DW (or APS) does not have the ability to generate IDENTITY values for tables like SQL Server.  Customers however, need a way to generate surrogate values to preserve uniqueness of data entities. Surrogate keys tend to be compact (int, bigint) data types that also to be known to facilitate faster joins and non-redundant distribution in b-tree structure.

NOTE: In SQL DW or (APS), the row_number function generally invokes broadcast data movement operations for dimension tables. This data movement cost is very high in SQL DW. For smaller increment of data, assigning surrogate key this way may work fine but for historical and large data loads this process may take a very long time. In some cases, it may not work due to tempdb size limitations. Our advice is to run the row_number function in smaller chunks of data.

Solution

Method 1:

Following is the structure of customer table in SQL DW. Let’s assume for this table we have already generated surrogate keys for the data in SQL DW. In table c_customer_sk is the surrogate key column.

CREATE TABLE [dbo].[customer]
(
[c_customer_sk] INT NOT NULL, --Surrogate Key
[c_customer_id] CHAR(16) NOT NULL, -- Business Key
[c_current_cdemo_sk] INT NULL,
[c_current_hdemo_sk] INT NULL,
[c_current_addr_sk] INT NULL,
[c_first_shipto_date_sk] INT NULL,
[c_first_sales_date_sk] INT NULL,
[c_salutation] CHAR(10) NULL,
[c_first_name] CHAR(20) NULL,
[c_last_name] CHAR(30) NULL,
[c_preferred_cust_flag] CHAR(1) NULL,
[c_birth_day] INT NULL,
[c_birth_month] INT NULL,
[c_birth_year] INT NULL,
[c_birth_country] VARCHAR(20) NULL,
[c_login] CHAR(13) NULL,
[c_email_address] CHAR(50) NULL,
[c_last_review_date] CHAR(10) NULL
);

We have some new data that needs to be loaded into customer table with surrogate key assigned. Load the new customer data into a transient table called customer_staging that does not contain the c_customer_sk column.

CREATE TABLE [dbo].[customer_staging]
(
[c_customer_id] CHAR(16) NOT NULL, -- Business Key
[c_current_cdemo_sk] INT NULL,
[c_current_hdemo_sk] INT NULL,
[c_current_addr_sk] INT NULL,
[c_first_shipto_date_sk] INT NULL,
[c_first_sales_date_sk] INT NULL,
[c_salutation] CHAR(10) NULL,
[c_first_name] CHAR(20) NULL,
[c_last_name] CHAR(30) NULL,
[c_preferred_cust_flag] CHAR(1) NULL,
[c_birth_day] INT NULL,
[c_birth_month] INT NULL,
[c_birth_year] INT NULL,
[c_birth_country] VARCHAR(20) NULL,
[c_login] CHAR(13) NULL,
[c_email_address] CHAR(50) NULL,
[c_last_review_date] CHAR(10) NULL
);

Now issue the following insert command that will insert the rows from customer_staging to customer table while assigning unique increasing value to the column c_customer_sk column.

INSERT INTO dbo.customer
SELECT maxid.maxid + ROW_NUMBER() OVER (ORDER BY maxid) AS [c_customer_sk]
,c_customer_id
,c_current_cdemo_sk
,c_current_hdemo_sk
,c_current_addr_sk
,c_first_shipto_date_sk
,c_first_sales_date_sk
,c_salutation
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_day,c_birth_month
,c_birth_year,c_birth_country
,c_login,c_email_address
,c_last_review_date
FROM dbo.customer_staging
CROSS JOIN
(SELECT ISNULL(MAX(c_customer_sk),0) AS maxid FROM dbo.customer)maxid

If source customer table is small enough such that the ETL logic uploads the whole customer table every time into customer_staging inclusive of new data, a left outer join technique could also be used to exclude the duplicates before inserting the data into customer table.

Note: It is our recommendation to avoid loading the full table for very large dimensions every time.

INSERT INTO dbo.customer
SELECT maxid.maxid + ROW_NUMBER() OVER (ORDER BY maxid) AS [c_customer_sk]
, CStaging.*
FROM
(
select cs.* from dbo.customer_staging cs
left outer join dbo.customer c
on cs.c_customer_id = c.c_customer_id
where c.c_customer_id is null
) CStaging
CROSS JOIN
(SELECT ISNULL(MAX(c_customer_sk),0) AS maxid FROM dbo.customer)maxid;

Method 2:

In some cases, the values of a dimension table need to be extracted from the fact table. The example below is similar to the last one with few simple changes.

Let’s assume dim_customer is our existing dimension table that is populated with attributes from the fact table stage_store_sales.

CREATE TABLE [dbo].[dim_customer]
(
[ss_customer_sk] INT NULL,
[ss_customer_id] CHAR(16) NULL,
[ss_customer_name] VARCHAR(50) NULL
);
CREATE TABLE [dbo].[stage_store_sales]
(
[ss_sold_date_sk] INT NULL,
[ss_sold_time_sk] INT NULL,
[ss_item_sk] INT NOT NULL,
[ss_customer_id] CHAR(16) NULL,  -- dim value
[ss_customer_name] VARCHAR(50) NULL, -- dim value
[ss_cdemo_sk] INT NULL,
[ss_hdemo_sk] INT NULL,
[ss_addr_sk] INT NULL,
[ss_store_sk] INT NULL,
[ss_promo_sk] INT NULL,
[ss_ticket_number] INT NOT NULL,
[ss_quantity] INT NULL,
[ss_wholesale_cost] DECIMAL(7,2) NULL,
[ss_list_price] DECIMAL(7,2) NULL,
[ss_sales_price] DECIMAL(7,2) NULL,
[ss_ext_discount_amt] DECIMAL(7,2) NULL,
[ss_ext_sales_price] DECIMAL(7,2) NULL,
[ss_ext_wholesale_cost] DECIMAL(7,2) NULL,
[ss_ext_list_price] DECIMAL(7,2) NULL,
[ss_ext_tax] DECIMAL(7,2) NULL,
[ss_coupon_amt] DECIMAL(7,2) NULL,
[ss_net_paid] DECIMAL(7,2) NULL,
[ss_net_paid_inc_tax] DECIMAL(7,2) NULL,
[ss_net_profit] DECIMAL(7,2) NULL
)
WITH (DISTRIBUTION = HASH ([ss_item_sk]));

The following view contains the logic to find and exclude the duplicates from the fact and the dimension table. Only new values will be inserted into the dimension table.

CREATE VIEW [dbo].[V_stage_dim_customer] AS SELECT DISTINCT [ss_customer_sk] = maxid.maxid + ROW_NUMBER() OVER (ORDER BY maxid),[ss_customer_id], ss_customer_name
FROM
(
SELECT DISTINCT ss.[ss_customer_id]
, ss.ss_customer_name
FROM [dbo].[stage_store_sales] [ss] --staging fact table
LEFT OUTER JOIN [dbo].[dim_customer] [dc] --dimension table
ON [ss].[ss_customer_id] = [dc].[ss_customer_id]
WHERE [dc].[ss_customer_id] IS NULL AND [ss].[ss_customer_id] IS NOT NULL
) Cstaging
CROSS JOIN (SELECT ISNULL(MAX([ss_customer_sk]),0) AS maxid FROM [dbo].[dim_customer]) maxid;

Run the following insert-select statement to insert new values into the dimension table from the view.

INSERT INTO dim_customer
SELECT * FROM V_stage_dim_customer;

Note: In some cases, a row number is present in the data file or can easily be added. In that case this row number added with the max value from the existing dimension can be used to create the surrogate key.

Comments

  • Anonymous
    August 29, 2016
    Hi Murshed,I am using external tables (polybase) to load via AZcopy and afterward I want to generate a unique identifier (something similar to surrogate key) .I can use the above row number approach but we have performance issues and it either fails or completes with terrible performance.Is there any alternative way to maintain data integrity and performing optimal lookups for incremental data .My daily file loads are in Gbs against similar volume in target table.--vinay