Sunday, April 28, 2013

Datastage:Slowly Changing Dimension (SCD) Stage

There are a number of ways to manage slowly changing dimensions using DataStage. The choice of which method to use depends partly on the type of dimension you are populating.

Types of SCD

  • Type 1 - no history, records are simply overwritten
  • Type 2 - changes trigger an update that expire the existing record and an insert to add the new record
  • Type 1 and Type 2 - fields are classified as either type1 or type 2
DataStage offers specialized stages as well as the more generic stages. Regardless of the implementation, the basic algorithm for adding new records to a dimension is as follows:
for each new record get business key check if business key exists in dimension if so then check if record has changed if so then if (type 1 change) then update dimension record if (type 2 change) then expire dimension record and add new dimension record end if else add new dimension record end if next record
In this tutorial, I will be discussing the SCD stage. I've got a very simple employee table with the following structure.

The natural key for the table is Name. I will be using this table as the source for the employee dimension. The employee dimension structure is shown below.

SCD Stage

The SCD stage doesn't have what I would call an intuitive interface. It takes 2 inputs (a primary link, a reference link) and 2 output links . The primary link is connected to the incoming data while the reference link is connected to the dimension.

Open the SCD stage and there is a type of wizard called "Fast Path". I found this to be the simplest way of editing the stage.

Step 1 - Define the output

Define the output link. The output link is used to propagate fields to fact tables or to other stages including other SCD stages.

Step 2 - Define the lookup fields

On this page, we define the business key, type 1 and type 2 fields, current row indicator and the effective and expiry date fields. Note that if there are any type 2 fields, there must also be a current row indicator.

Step 3 - Define surrogate key source

This will be used to populate the EMP_ID field for new records.

Step 4 - Define the dimension fields

This page is where we define how the dimension fields are populated. Note that END_DATE and CURRENT_ROW_IND have 2 deriviations - one for adding records and one for expiring records. Note that these derivations must be either literals or functions. Also note that Job Parameters cannot be used here.

Step 5 - Define output fields

The final page defines what fields we want to send to the output. We have the choice of fields from the Primary and Reference link.


For my money, the SCD stage is not really great value. It basically combines a lookup and a transform stage but it does it in a very clumsy way. Also, it does not give you the ability to process deleted records. It's designed to process incremental loads so if you have a full load and you need to detect deleted records, this stage will not do it for you.

0 comments:

Post a Comment

Please Post your Comments..!