Modelling SCD Type 6 (1+2+3) Slowly Changing Dimensions using dbt and Looker

Mark Rittman

Like many of our clients we use Hubspot to track our sales opportunities, and use Stitch and dbt to extract and then transform data on these deals for analysis in Looker.

Stitch extracts data from Hubspot every 30 minutes and our Looker dashboards show the current state of our pipeline based on the each deal’s current funnel stagel; similarly our Looker explore allows users to slice-and-dice this pipeline as-of now by channel, technology type, resource profile and deal source.


current_sprint_only.pngcurrent_sprint_only.png

Together these give us useful insights into where our new business comes from today, what skills our team will need for deals just about to close and whether we’re hitting targets today for enquiries, open deals and closed business.

But there’s also a category of sales metrics that require us to look back at the history of each deal over time to see whether we’re moving deals through the sales process fast enough, or to measure our sales metrics and customer counts today against those same metrics at earlier points in the year.

Hubspot as you’d expect tracks the various stages that each deal goes through, as shown in the screenshot below of one of our deals that went from enquiry through to proposal, then PoC and finally to closed won business.


hubspot2.pnghubspot2.png

In data warehouse data modelling terminology this is what’s called a “slowly changing dimension” (SCD), with our Looker model currently tracking just the current state of the deals dimension (SCD Type 1) and our need being to track the history of changes to that dimension (making it then a SCD Type 2 dimension).

Creating a history of these changes requires us to somehow track when a change happens to the details of a deal, for example changing its status from open enquiry to closed business, and then write these changes to a “history” table for this dimension with valid-from and valid-to dates that allow us to view the status of the deal at any particular point in-time.

Our preferred data modelling and transformation tool is dbt (“Data Build Tool”) and it has a built-in feature for SCD2-style versioning of changes to a table called Snapshots. This feature takes a SQL SELECT statement and the names of columns we want to use as the unique key and updated_at columns and then uses these to create a changed data capture process, automatically building for us a snapshot-versioned history of our deals table that looks like this:


scd2.pngscd2.png

It’s then relatively straightforward to add this snapshot table back into Looker as another deals view that joins to the rest of our model on the deal_id, for example, but then you’ve got two distinct LookML views within Looker that contain details on our sales deals, one of which points to a table that holds the current state of each deal and the other that points to our history table, making it tricky to answer questions such as “for each deal and it’s current deal state, what’s the average number of days its taken that deal to progress through its stages”.


two_deals.pngtwo_deals.png

A neater solution would be to create what’s termed a “Type 6” slowly changing dimension, with the “6” based on the fact that it combines features of type 1 (track current state), type 2 (track historic state) and type 3 (track previous state) slowly changing dimensions in a single table (i.e. 1+2+3 = 6).

Referencing the Kimball University website article on Type 6 slowly changing dimensions:

“Type 6 builds on the type 2 technique by also embedding current attributes in the dimension so that fact rows can be filtered or grouped by either the type 2 value in effect when the measurement occurred or the attribute’s current value. The type 6 moniker was suggested by an HP engineer in 2000 because it’s a type 2 row with a type 3 column that’s overwritten as a type 1; both 2 + 3 + 1 and 2 x 3 x 1 equal 6. With this approach, the current attributes are updated on all prior type 2 rows associated with a particular durable key, as illustrated by the following sample rows:”


type-621.pngtype-621.png

In dbt terms this means setting up our snapshot in the same way as before, using the batched_at Stitch extract metadata field to select the most recent row from the Hubspot deal extract table, like this:

/*
  This snapshot table will live in:
    analytics.snapshots.deals_snapshot
*/

{% snapshot deals_snapshot %}

    {{
        config(
          target_schema='snapshots',
          strategy='check',
          unique_key='deal_id',
          check_cols='all'
        )
    }}

with deals as (

        select *  from (
          select *,
           MAX(_sdc_batched_at) OVER 
            (PARTITION BY dealid 
             ORDER BY _sdc_batched_at 
             RANGE BETWEEN UNBOUNDED PRECEDING 
             AND UNBOUNDED FOLLOWING) latest_sdc_batched_at
           from 
             {{ source('hubspot', 'deals') }})
           where latest_sdc_batched_at = _sdc_batched_at
),

new_deal as (

    select
      properties.dealname.value AS dealname,
      properties.hubspot_owner_id.value AS hubspot_owner_id,
      properties.dealstage.value AS dealstage,
      properties.dealstage.value as dealstage_id,
      properties.dealstage.timestamp as dealstage_ts,
      properties.pipeline.value AS pipeline,
      properties.closedate.value AS closedate,
      properties.createdate.value AS createdate,
      ...
      dealid AS deal_id,
      _sdc_batched_at

    from deals

)

select * from new_deal

{% endsnapshot %}

Then we create a dbt model that reads from this snapshot and uses the snapshot row contents as-is for our SCD2 column values, then uses the LAG() OVER() analytic function to get our SCD3 previous value columns and the LAST_VALUE() OVER () analytic function to get our current value columns, like this:

scd_deals as (

    select

-- SCD2 "historical values" attributes

    *,

-- SCD3 "previous value" attributes


      lag(dealstage_ts) over 
    (partition by deal_id 
     order by dbt_updated_at) as previous_dealstage_ts,
      lag(stage_displayorder) over 
    (partition by deal_id 
     order by dbt_updated_at) previous_stage_displayorder,
      lag(stage_label) over 
    (partition by deal_id 
     order by dbt_updated_at) previous_stage_label,

-- SCD1 "current" value attributes

      last_value(dealstage_ts) over 
       (partition by deal_id 
        order by dbt_updated_at 
        ROWS BETWEEN UNBOUNDED PRECEDING 
        AND UNBOUNDED FOLLOWING) as current_dealstage_ts,
      last_value(dealname) over (partition by deal_id 
        order by dbt_updated_at 
        ROWS BETWEEN UNBOUNDED PRECEDING 
        AND UNBOUNDED FOLLOWING) as current_dealname,
      last_value(dealstage_id) over (partition by deal_id 
        order by dbt_updated_at 
        ROWS BETWEEN UNBOUNDED PRECEDING 
        AND UNBOUNDED FOLLOWING) as current_dealstage_id,
      last_value(stage_label) over (partition by deal_id 
        order by dbt_updated_at 
        ROWS BETWEEN UNBOUNDED PRECEDING 
        AND UNBOUNDED FOLLOWING) as current_stage_label,

-- and "original" value attributes

      first_value(amount) over (partition by deal_id 
        order by dbt_updated_at 
        ROWS BETWEEN UNBOUNDED PRECEDING 
        AND UNBOUNDED FOLLOWING) as original_amount

    from joining

),

date_difference as (

    select *,
      timestamp_diff(dealstage_ts,previous_dealstage_ts, day) 
    as days_between_stage,
      timestamp_diff(current_timestamp(),original_createddate,day) 
    as days_since_deal_created,
      current_amount - original_amount 
    as amount_diff_since_deal_created
    from scd_deals

)

select * except (pipelineid,_sdc_batched_at) from date_difference

Right at the end then you’ll see a final CTE that uses the current and previous deal stage timestamps to calculate days between funnel stages for each deal and stage in that deal’s progress through the funnel, and a final metric that measures days since the deal was first recorded.

Now we can map just this single SCD6 table into our Looker model and have just the one view in our Looker explore for deals information, like this:


singletable.pngsingletable.png

And then using this new SCD6 table and Looker view we can create queries that bring together data on a deal now and across all time, like this:


current_and_historic_1.pngcurrent_and_historic_1.png

or we can compare the stage our deal is at now with its previous deal stage, like this:


current_and_previous.pngcurrent_and_previous.png

If one of our users just selects the regular dimension fields for a deal, the query uses our “current value” columns along with GROUP BY and SUM_DISTINCT to return a single row for each deal and the correct measure numbers, without the user having to specifically ask for the current version of each deal.


current_state.pngcurrent_state.png

But if they want to see the deal history, they just select the historic value columns alongside the regular deal details to see them listed out versioned courtesy of our dbt snapshot table.


historic.pnghistoric.png

For more details of what we can do for you with dbt, check the new dbt Solutions page on our website.