Datastage:Slowly Changing Dimension (SCD) Stage ~ Midhun`s Home

Sunday, April 28, 2013

Datastage:Slowly Changing Dimension (SCD) Stage

There are a number of ways to manage slowly changing dimensions using DataStage. The choice of which method to use depends partly on the type of dimension you are populating.

Types of SCD

Type 1 - no history, records are simply overwritten
Type 2 - changes trigger an update that expire the existing record and an insert to add the new record
Type 1 and Type 2 - fields are classified as either type1 or type 2

DataStage offers specialized stages as well as the more generic stages. Regardless of the implementation, the basic algorithm for adding new records to a dimension is as follows:


for each new record
  get business key
  check if business key exists in dimension
  if so then
    check if record has changed
    if so then
      if (type 1 change) then update dimension record
      if (type 2 change) then expire dimension record and add new dimension record
    end if
  else
    add new dimension record
  end if
next record

In this tutorial, I will be discussing the SCD stage. I've got a very simple employee table with the following structure.

The natural key for the table is Name. I will be using this table as the source for the employee dimension. The employee dimension structure is shown below.

SCD Stage

The SCD stage doesn't have what I would call an intuitive interface. It takes 2 inputs (a primary link, a reference link) and 2 output links . The primary link is connected to the incoming data while the reference link is connected to the dimension.

Open the SCD stage and there is a type of wizard called "Fast Path". I found this to be the simplest way of editing the stage.

Step 1 - Define the output

Define the output link. The output link is used to propagate fields to fact tables or to other stages including other SCD stages.

Step 2 - Define the lookup fields

On this page, we define the business key, type 1 and type 2 fields, current row indicator and the effective and expiry date fields. Note that if there are any type 2 fields, there must also be a current row indicator.

Step 3 - Define surrogate key source

This will be used to populate the EMP_ID field for new records.

Step 4 - Define the dimension fields

This page is where we define how the dimension fields are populated. Note that END_DATE and CURRENT_ROW_IND have 2 deriviations - one for adding records and one for expiring records. Note that these derivations must be either literals or functions. Also note that Job Parameters cannot be used here.

Step 5 - Define output fields

The final page defines what fields we want to send to the output. We have the choice of fields from the Primary and Reference link.

For my money, the SCD stage is not really great value. It basically combines a lookup and a transform stage but it does it in a very clumsy way. Also, it does not give you the ability to process deleted records. It's designed to process incremental loads so if you have a full load and you need to detect deleted records, this stage will not do it for you.

Midhun`s Home

Sunday, April 28, 2013