There are a number of ways to manage slowly changing dimensions using
DataStage. The choice of which method to use depends partly on the type
of dimension you are populating.
The natural key for the table is Name. I will be using this table as the source for the employee dimension. The employee dimension structure is shown below.
Open the SCD stage and there is a type of wizard called "Fast Path". I found this to be the simplest way of editing the stage.
For my money, the SCD stage is not really great value. It basically combines a lookup and a transform stage but it does it in a very clumsy way. Also, it does not give you the ability to process deleted records. It's designed to process incremental loads so if you have a full load and you need to detect deleted records, this stage will not do it for you.
Types of SCD
- Type 1 - no history, records are simply overwritten
- Type 2 - changes trigger an update that expire the existing record and an insert to add the new record
- Type 1 and Type 2 - fields are classified as either type1 or type 2
for each new record
get business key
check if business key exists in dimension
if so then
check if record has changed
if so then
if (type 1 change) then update dimension record
if (type 2 change) then expire dimension record and add new dimension record
end if
else
add new dimension record
end if
next record
In this tutorial, I will be discussing the SCD stage. I've got a very simple employee table with the following structure.The natural key for the table is Name. I will be using this table as the source for the employee dimension. The employee dimension structure is shown below.
SCD Stage
The SCD stage doesn't have what I would call an intuitive interface. It takes 2 inputs (a primary link, a reference link) and 2 output links . The primary link is connected to the incoming data while the reference link is connected to the dimension.Open the SCD stage and there is a type of wizard called "Fast Path". I found this to be the simplest way of editing the stage.
Step 1 - Define the output
Define the output link. The output link is used to propagate fields to fact tables or to other stages including other SCD stages.Step 2 - Define the lookup fields
On this page, we define the business key, type 1 and type 2 fields, current row indicator and the effective and expiry date fields. Note that if there are any type 2 fields, there must also be a current row indicator.Step 3 - Define surrogate key source
This will be used to populate the EMP_ID field for new records.Step 4 - Define the dimension fields
This page is where we define how the dimension fields are populated. Note that END_DATE and CURRENT_ROW_IND have 2 deriviations - one for adding records and one for expiring records. Note that these derivations must be either literals or functions. Also note that Job Parameters cannot be used here.Step 5 - Define output fields
The final page defines what fields we want to send to the output. We have the choice of fields from the Primary and Reference link.For my money, the SCD stage is not really great value. It basically combines a lookup and a transform stage but it does it in a very clumsy way. Also, it does not give you the ability to process deleted records. It's designed to process incremental loads so if you have a full load and you need to detect deleted records, this stage will not do it for you.
0 comments:
Post a Comment
Please Post your Comments..!