Managing Dimensional Data Changes: A Refresher to Slowly Changing Dimensions (SCDs) and their Types
Imagine you are working with a retail company that sells products online. One of the dimensions you are tracking is “Product”, and you want to keep track of the changes in product attributes over time.
Let's say you have a product called “Blue T-shirt”. Initially, the attributes of the products are:
+--------------+--------+----------+-------------------------------------+
| Name | Price | Category | Description |
+--------------+--------+----------+-------------------------------------+
| Blue T-shirt | 560/- | clothing | premium clothing for premium people |
+--------------+--------+----------+-------------------------------------+
Now the price of the Blue T-shirt increased from 560/- to 750/-. Here’s how the SCD types handle this change.
- SCD Type-1
Overwrite the existing record with the new data. After the change, the attributes in the record look like this
+--------------+--------+----------+-------------------------------------+
| Name | Price | Category | Description |
+--------------+--------+----------+-------------------------------------+
| Blue T-shirt | 750/- | clothing | premium clothing for premium people |
+--------------+--------+----------+-------------------------------------+
This is useful when you only need the most current data and historical changes are unimportant. Good fit when the attributes do not need any tracking.
- SCD Type-2
Maintains a historical record of changes by creating new rows for each change, with a start and end date for each data version. After the price change, you would have two rows in the table.
+--------------+--------+----------+-------------------------------------+------------+------------+
| Name | Price | Category | Description | start_date | end_date |
+--------------+--------+----------+-------------------------------------+------------+------------+
| Blue T-shirt | 560/- | clothing | premium clothing for premium people | 20-08-2023 | 22-08-2023 |
| Blue T-shirt | 750/- | clothing | premium clothing for premium people | 22-08-2023 | nil |
+--------------+--------+----------+-------------------------------------+------------+------------+
This type is useful when tracking historical changes for the attributes.
- SCD Type-3
Maintains limited historical changes. After the price change, you would update the existing row with a new attribute to store the old price.
+--------------+---------------+-----------+----------+-------------------------------------+
| Name | Current Price | Old Price | Category | Description |
+--------------+---------------+-----------+----------+-------------------------------------+
| Blue T-shirt | 560/- | 750/- | clothing | premium clothing for premium people |
+--------------+---------------+-----------+----------+-------------------------------------+
This type is useful when you want to track historical changes but don’t need a full historical record.
The story does not end here. Now that you have the data inserted, how can we efficiently retrieve the latest record and previous version records?
- SCD Type-1
Retrieving the latest record is straightforward as you do not have any versions stored in the table.
Store the changes in an Audit Log to retrieve the historical data.
SELECT *
FROM ProductDimension
WHERE ProductID = [desired_product_id]
ORDER BY StartDate DESC
LIMIT 1;
- SCD Type-2
Retrieving the latest record requires filtering by start_date as current_date or the end_date as nil.
SELECT *
FROM ProductDimension
WHERE ProductID = [desired_product_id]
AND EndDate = 'nil' or EndDate = '9999-12-31'; -- or use MAX(EndDate) if "future date" approach
Retrieving previous records
SELECT *
FROM ProductDimension
WHERE ProductID = [desired_product_id]
AND EndDate != '9999-12-31'
ORDER BY EndDate DESC;
- SCD Type-3
This is the same as Type-1 when retrieving the latest data. But, when you want to retrieve the old versions of the data, you can leverage them by using the old_price.
One important point to note here is Type-3 is less comprehensive than Type-2.
Retrieving the latest record
SELECT *
FROM ProductDimension
WHERE ProductID = [desired_product_id]
-- Retrieve the latest version by checking version columns or date columns
ORDER BY [appropriate_column] DESC
LIMIT 1;
Retrieving the previous version
SELECT *
FROM ProductDimension
WHERE ProductID = [desired_product_id]
-- Retrieve previous versions based on version columns or date columns
ORDER BY [appropriate_column] DESC;