Implementing Slowly Changing Dimension Type 2

This article presents an example implementation of SCD Type 2. It also explores the exceptional cases where updates occur in both driving and referential tables for a certain natural key.

Refer to Slowly changing dimensions for different types of SCDs with examples. The following table summarizes the common use case for each type.

SCD type Use case scenarios
SCD type 0 Durable data like constants, date dimensions
SCD type 1 Only current version of truth available, no need for historical data
SCD type 2 Need historical versions of data and the periods during which they were current
SCD type 3 Need for current data and the previous last value (alternate reality)
SCD type 4 Used when a group of attributes in a dimension rapidly changes and is split off to a mini–dimension (rapidly changing monster dimension.)
SCD type 5 Rarely used - to accurately preserve historical attribute values, plus report historical facts according to current attribute values; SCD 5 is equivalent to SCD 1 + SCD 4
SCD type 6 Rarely used - Unpredictable Changes with Single-Version Overlay; SCD 6 is equivalent to SCD 1 + SCD 2 + SCD 3
SCD type 7 Rarely used - Hybrid technique that supports both as-was and as-is reporting

Business Scenario

In this scenario, there are three delta tables that act as the source:

  1. employee
  2. employee_address
  3. employee_information

The aim is to create a single unified target table called target_employee that can provide a consolidated view of these tables. This presentation layer table also holds the versioning for all the three source tables.

The SCD Type 2 logic is implemented in a pyspark notebook. The following diagram provides the high-level architecture:

scd2 multiple tables

Understand applied concepts and capabilities

Here are the key concepts and delta table capabilities that are used for this use case.

Using driving and referential tables

  • The driving table represents the primary or key source table from which most of the final entity's information is derived. In this case, employee table is the driving table.
  • The referential tables act as additional sources, providing supplementary details to enrich the target entity. In this case, employee_address and employee_information tables are the referential tables.

Using change data feed

Change data feed allows Azure Databricks to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records change events for all the data written into the table.

For this use case, the change data feed has been enabled for all the three source tables.

How to use delta table properties

Here are some other characteristics of Delta tables that are used in this use case:

  • Each operation that modifies a Delta Lake table creates a new table version.
  • Delta tables offer "time travel" capabilities, allowing users to query a table's previous state based on timestamp or table version.
  • The historic information can be used to audit operations, rollback a table, or query a table at a specific point in time using time travel.

Use sample data

The use case is explained based on the following sample data.

How employee sample data is used

 employee_id  first_name  last_name  email                         create_update_date
---------------------------------------------------------------------------------------
 100          Steven      King       steven.king@contoso.com       2023-01-27
 101          Neena       Kochhar    neena.kocchar@contoso.com     2023-02-09
 102          Lex         De Haan    lex.de-haan@contoso.com       2023-04-07
 103          Alexander   Hunold     alexander.hunold@contoso.com  2023-02-24
 104          Bruce       Ernst      bruce.ernst@contoso.com       2023-04-24

How employee address sample data is used

employee_id  city      region         street_address      country         create_update_date
--------------------------------------------------------------------------------------------
100          New York  NY             123 Main St         United States   2023-01-27
101          London    England        456 Park Ave        United Kingdom  2023-02-09
102          Paris     ële-de-France  789 Rue de la Paix  France          2023-04-07
103          Tokyo     Tokyo          1-2-3 Shibuya       Japan           2023-02-24
104          Sydney    NSW            456 George St       Australia       2023-04-24

How employee information is used

employee_id  salary  is_fte  is_remote  employment_date  create_update_date
---------------------------------------------------------------------------
100          5000    True    False      2020-01-15       2023-01-27
102          5500    True    False      2021-03-22       2023-04-07
101          6000    True    True       2019-05-10       2023-02-09
103          5200    True    True       2018-11-30       2023-02-24
104          5800    True    False      2017-09-12       2023-04-24

How target employee sample data is used

id                               employee_id first_name last_name     email                    city     region        street_address     country        salary is_fte is_remote employment_date employee_create_update_date address_create_update_date info_create_update_date address_commit_version info_commit_version employee_commit_version valid_flag valid_from valid_to
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
d4d0569aa28e28d83a3d9a8888a30a27 101         Neena      Kochhar   neena.kocchar@contoso.com    London   England       456 Park Ave       United Kingdom 6000   True   True      2019-05-10      2023-02-09                  2023-02-09                 2023-02-09              1                      1                   1                       1          2023-02-09 9999-12-31
be613e2a8166f377f65967bd9073188d 102         Lex        De Haan   lex.de-haan@contoso.com      Paris    ële-de-France 789 Rue de la Paix France         5500   True   False     2021-03-22      2023-04-07                  2023-04-07                 2023-04-07              1                      1                   1                       1          2023-04-07 9999-12-31
5a60257eabea5111d4e8aa0cf56fb109 103         Alexander  Hunold    alexander.hunold@contoso.com Tokyo    Tokyo         1-2-3 Shibuya      Japan          5200   True   True      2018-11-30      2023-02-24                  2023-02-24                 2023-02-24              1                      1                   1                       1          2023-02-24 9999-12-31
014830e89472a8ac527a5c6b2c1e5fe5 100         Steven     King      steven.king@contoso.com      New York NY            123 Main St        United States  5000   True   False     2020-01-15      2023-01-27                  2023-01-27                 2023-01-27              1                      1                   1                       1          2023-01-27 9999-12-31
3b2c89840dc66c48dedbbb8ecac15ada 104         Bruce      Ernst     bruce.ernst@contoso.com      Sydney   NSW           456 George St      Australia      5800   True   False     2017-09-12      2023-04-24                  2023-04-24                 2023-04-24              1                      1                   1                       1          2023-04-24 9999-12-31

The target target_employee table is created from driving employee table and the referential employee_address and employee_information tables. There are following additional columns for tracking changes from each of these source tables:

  • employee_create_update_date: Tracks the created/updated date of the record for the employee table.
  • address_create_update_date: Tracks the created/updated date of the record for the address table.
  • info_create_update_date: Tracks the created/updated date of the record for the information table.
  • employee_commit_version: Tracks the commit version of the record from the employee table.
  • address_commit_version: Tracks the commit version of the record from the address table.
  • info_commit_version: Tracks the commit version of the record from the information table.
  • valid_flag: Tracks if the record is the latest for that natural key or a stale record.
  • valid_from: Tracks from when the record is active.
  • valid_to: Tracks to when the record is active.

Implement handled scenarios

There are two scenarios that are handled in this use case:

Scenario 1: multiple updates

In this scenario, data for a particular natural key is updated in both the driving and referential tables.

To implement SCD Type 2 for this scenario, a two-step process is followed.

Step 1: Separate capture of updates

When an update occurs in the driving table, it is captured separately and marked with an appropriate version number, signifying the change. Similarly, when a change happens in any of the referential tables, it is also captured independently, retaining its own version number.

Step 2: union of updates

To obtain a comprehensive view of all the updates made to a particular record, a union of the updates from both the driving and referential tables is performed. This union ensures that all relevant changes are accounted for, and the historical chain of updates remains intact.

Scenario 2: simultaneous updates

In certain exceptional cases, both the driving and referential tables might receive updates for the same natural key simultaneously. To maintain the integrity of historical data, the proposal is to create a historical chain of updates.

In this scenario, the driving table's update is treated as the primary update, and the relevant referential table updates are linked to it in a chronological order. This approach preserves the historical context and provides a comprehensive view of all changes made to the record.

Consider a scenario with the following updates:

-- Apply some updates to "employee" table.
update employee set first_name = 'Steven2',  dt = '2023-07-13' where employee_id = '100';
update employee set first_name = 'Steven3' , dt = '2023-07-17' where employee_id = '100';

-- Apply some updates to "employee_address" table.
update employee_address set city = 'Lille',     dt = '2023-07-13'                where employee_id = '102';
update employee_address set city = 'NewCastle', dt = '2023-07-14'                where employee_id = '104';
update employee_address set street_address = '456 Baker Street', dt ='2023-07-16' where employee_id = '100';

-- Apply some updates to "employee_information" table.
update employee_information set salary = 7000,  dt = '2023-07-17' where employee_id = '108';
update employee_information set salary = 8000,  dt = '2023-07-16' where employee_id = '102';
update employee_information set is_fte = False, dt = '2023-07-15' where employee_id = '104';

Understand the implementation

Here is the high-level implementation logic for this use case:

  • Gather the changes (updates and inserts) that have arrived in all three tables.
    • Read the commit versions stored in target_employee table for employee, employee_address and employee_information tables and get the greatest value.
    • Get the changes that have a commit version greater than the calculated value in the previous step. Do it for all of the three tables.
    • Create a change dataframe for each source table.
  • Create a dataframe capturing referential changes from employee_address and employee_information tables.
    • read the entire driving table employee to get all natural keys.
    • Perform a join with change dataframe that captures updates for records of employee_address and employee_information tables.
    • Create a new column changed_commit which will hold a value, if an update has occurred for that natural key in either address or information table.
    • Filter out the records with natural key for which an update has happened using changed_commit column.
  • Create a dataframe capturing driving table changes.
    • read the updates of the driving table and join it with changes from the address and information tables.
  • Unite the driving and referential table changes to get all the changes.
  • Apply the merge logic.

Choose methods to track SCD type 2

The solution describes two methods of tracking SCD type 2 for the target entity.

Method 1: Capture the latest updates only

In this method, if multiple updates are coming from sources for each record, only the latest updates are captured. To get all the updates, the following logic is applied:

spark.sql("""
  select *
    from (select *,
                row_number() over (partition by employee_id
                                       order by changed_commit desc) as rank
            from changes_employee_target)
   order by
         employee_id,
         changed_commit desc
""")

And then, the latest updates are captured using the following query:

spark.sql("""
  select *
    from employee_all_updates
   where rank=1
""")

Consider the scenario, where for a certain natural key, an update has happened in the employee_address table but not in other tables. While capturing the changes, the columns corresponding to other tables will show as null that is not a right representation of the update. The record will look like this:

employee_id first_name last_name email                   city     region street_address   country       salary is_fte is_remote employment_date employee_create_update_date address_create_update_date info_create_update_date address_commit_version info_commit_version employee_commit_version valid_flag valid_from                   valid_to                     changed_commit rank
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
100         Steven3    King      steven.king@contoso.com New York NY     456 Baker Street United States null   null   null      null            2023-07-17                  2023-07-13                 null                    4                      null                4                       1          2023-07-17 9999-12-31 4              1

To solve the above issue, the coalesce function is used to capture values of columns in the target_employee table that have not changed. This approach represents the more accurate view of the record.

Note that the above approach does not cover the edge case where the actual value being updated for a column is Null.

After coalescing, the record looks like this:

employee_id first_name last_name email                     city     region street_address   country       salary   is_fte   is_remote employment_date employee_create_update_date address_create_update_date info_create_update_date address_commit_version info_commit_version employee_commit_version valid_flag valid_from                   valid_to                     _commit_version
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
100         Steven3    King      steven.king@contoso.com New York NY     456 Baker Street United States **5000** **True** **False** 2020-01-15      2023-07-17                  2023-07-13                 2023-01-27              4                      1                   4                       1          2023-07-17 9999-12-31 4

The values for columns is_fte, is_remote, and salary are retained as these column values have not changed.

Below, only the latest update has been considered. The date of job run ("2023-08-03") is used to derive valid_from and valid_to columns.

employee_id first_name last_name email                     city     region street_address   country       salary is_fte is_remote employment_date employee_create_update_date address_create_update_date info_create_update_date address_commit_version info_commit_version employee_commit_version valid_flag valid_from valid_to
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
100         Steven     King      steven.king@contoso.com New York NY     123 Main St      United States 5000   True   False     2020-01-15      2023-01-27                  2023-01-27                 2023-01-27              1                      1                   1                       0          2023-01-27 2023-08-11
100         Steven3    King      steven.king@contoso.com New York NY     456 Baker Street United States 5000   True   False     2020-01-15      2023-07-17                  2023-07-13                 2023-01-27              4                      1                   4                       1          2023-08-11 9999-12-31

Method 2: Historical chaining of records

In this method, all the updates that happen for a certain id are captured and a chain of history is maintained for those updates.

The lead function takes the next start date for the same natural key, ordered by date (dt). The row_number function ranks the record in the order of modification in the source table.

  lead(greatest(employee_create_update_date, address_create_update_date, info_create_update_date))
        over (partition by employee_id
                  order by employee_create_update_date) as valid_to,
   ---take the greatest value of dates
  row_number()
        over (partition by employee_id
                  order by employee_create_update_date) rn
  • Store all the changed records in a table called changes_history.
  • Update valid_flag=1 and valid_to= <infinite_end_date> for highest ranking record.
  • Proceed to close the current record in target table using update sql operation.
update target_employee as t
   set valid_to = (
         select first(valid_to) --- rn=1 stored in changes_history corresponds to original record in target table . Use the calculated valid_to to hence update original record
           from changes_history as c
          where c.employee_id = t.employee_id
            and c.rn = 1),
       valid_flag = 0
 where employee_id IN (
          select employee_id
            from changes_history
           where rn = 1)
---latest records now become stale records
   and valid_flag = 1;
  • Insert the records into target table having rank > 1.

Below is an example of how the historical chain of updates is maintained for a natural key. Here, all the prior updates are considered to maintain a chain of history. The valid _from and valid_to columns consider the dt column in the source that indicates when the row was added or updated (in this case 2023-07-13, 2023-07-17).

id                               employee_id first_name last_name email                   city     region street_address   country       salary is_fte is_remote employment_date employee_create_update_date address_create_update_date info_create_update_date address_commit_version info_commit_version employee_commit_version valid_flag valid_from valid_to
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
5e64f88f256b598f6c8a801b8d560285 100         Steven2    King      steven.king@contoso.com New York NY     456 Baker Street United States 5000   True   False     2020-01-15      2023-07-13                  2023-07-13                 2023-01-27              4                      1                   3                       0          2023-07-13 2023-07-17
08f663dec02e58679ba6e6af5cf67882 100         Steven3    King      steven.king@contoso.com New York NY     456 Baker Street United States 5000   True   False     2020-01-15      2023-07-17                  2023-07-13                 2023-01-27              4                      1                   4                       1          2023-07-17 9999-12-31
014830e89472a8ac527a5c6b2c1e5fe5 100         Steven     King      steven.king@contoso.com New York NY     123 Main St      United States 5000   True   False     2020-01-15      2023-01-27                  2023-01-27                 2023-01-27              1                      1                   1                       0          2023-01-27 2023-07-13

For more information