Posts tagged dbt
Rittman Analytics is now a UK Consulting Partner for dbt ("Data Build Tool")

We’re very pleased to announce that Rittman Analytics is now an official Consulting Partner for dbt, working with our clients, the community and the wider dbt ecosystem to get the most out of this open-source analytics framework and accompanying commercial dbtCloud service.

You can read more about the role of Consulting Partners on the new Ecosystem page on the getdbt website and while you’re there, check-out comments from real-world users of dbt including one of ours, an excerpt from our recent blog post on How Rittman Analytics does Analytics.

In that blog post we talked about how Rittman Analytics used dbt as part of our modular “extract, transform and load” data loading approach when preparing data ready for use with Looker:

ra_analytics_architecture.png

Since launching the company last year we’ve built-up a fair bit of experience implementing dbt and dbtCloud on several clients projects in areas such as

  • transforming event-based (Segment, mParticle), data pipeline (Stitch, Fivetran) and batch-loaded raw data sourced from SaaS, Telco, web application and other data sources

  • using Snowflake Data Warehouse, Postgres and Google BigQuery as sources/targets

  • integrating with GitHub, AWS CodeCommit for local/remote branch-based git development

  • creation of automated CI/CD dbt build test pipelines using dbtCloud and AWS CodeBuild/CodePipeline

If you’re new to dbt or interested in how we use dbt and dbtCloud both internally and on client projects, check out our three recent blog posts on this topic:

and conversations we’ve had in the past with Tristan Handy, CEO and co-founder of Fishtown Analytics (primary sponsors of dbt) on our Drill to Detail Podcast:

Contact us now or email info@rittmananalytics.com if you’d like any help, or advice on your dbt implementation - we’re UK-based with clients in London, northern Europe the USA.

Continuous Integration and Automated Build Testing with dbtCloud

If you read my recent blog post on how Rittman Analytics built our operational analytics platform running on Looker, Stitch, Google BigQuery and dbt, you’ll know that we built-out the underlying data infrastructure using a modular Extract, Transform and Load (ELT) design pattern, like this:

ra_analytics_architecture (1).png

We version-control our dbt development environment using git and Github.com, and do all new development beyond simple bug fixes as git feature branches, giving us a development process for dbt that looked like this:

original2.png

1. We’d start by cloning dbt git repo master branch from github.com to the developer’s workstation, which also would have dbt installed locally along with the Google Cloud SDK so that they can connect to our development BigQuery dataset.

clone.png

2. Create a new, local git branch for the new feature using the git CLI or a tool such as Github desktop

newnrahc.png

3. Develop the new feature locally using the developer’s install of dbt, committing any changes to the feature branch in that developer’s local git repo after checking that all dbt tests have run successfully.

locals-imac:ra_dw markrittman$ dbt run --models harvest_time_entries harvest_invoices --target dev
Running with dbt=0.13.1
Found 50 models, 8 tests, 0 archives, 0 analyses, 109 macros, 0 operations, 0 seed files, 35 sources

20:53:11 | Concurrency: 1 threads (target='dev')
20:53:11 | 
20:53:11 | 1 of 2 START table model ra_data_warehouse_dbt_dev.harvest_invoices.. [RUN]
20:53:13 | 1 of 2 OK created table model ra_data_warehouse_dbt_dev.harvest_invoices [OK in 1.80s]
20:53:13 | 2 of 2 START table model ra_data_warehouse_dbt_dev.harvest_time_entries [RUN]
20:53:15 | 2 of 2 OK created table model ra_data_warehouse_dbt_dev.harvest_time_entries [OK in 1.70s]
20:53:15 | 
20:53:15 | Finished running 2 table models in 5.26s.

Completed successfully

Done. PASS=2 ERROR=0 SKIP=0 TOTAL=2
locals-imac:ra_dw markrittman$ dbt test --models harvest_time_entries harvest_invoices --target dev
Running with dbt=0.13.1
Found 50 models, 8 tests, 0 archives, 0 analyses, 109 macros, 0 operations, 0 seed files, 35 sources

20:53:37 | Concurrency: 1 threads (target='dev')
20:53:37 | 
20:53:37 | 1 of 4 START test not_null_harvest_invoices_id....................... [RUN]
20:53:39 | 1 of 4 PASS not_null_harvest_invoices_id............................. [PASS in 1.50s]
20:53:39 | 2 of 4 START test not_null_harvest_time_entries_id................... [RUN]
20:53:40 | 2 of 4 PASS not_null_harvest_time_entries_id......................... [PASS in 0.89s]
20:53:40 | 3 of 4 START test unique_harvest_invoices_id......................... [RUN]
20:53:41 | 3 of 4 PASS unique_harvest_invoices_id............................... [PASS in 1.08s]
20:53:41 | 4 of 4 START test unique_harvest_time_entries_id..................... [RUN]
20:53:42 | 4 of 4 PASS unique_harvest_time_entries_id........................... [PASS in 0.83s]
20:53:42 | 
20:53:42 | Finished running 4 tests in 5.27s.

Completed successfully

4. All dbt transformations at this stage are being deployed to our development BigQuery dataset.

5. When the feature was then ready for deployment to our production BigQuery dataset, the developer would then push the changes in their local branch to the remote git repo, creating that branch if it didn’t already exist.

commit.png

6. Then they’d create a pull request using Github Desktop and the Github web UI summarising the changes and new features added by the development branch.

create_pr.png

7. I’d then review the PR, try and work out if the changes were safe to merge into the master branch and then accept the pull reques, and then overnight our dbtCloud service would clone that development git repo master branch and attempt to deploy the new set of transformations to our production BigQuery dataset.

And sometimes, if for whatever reason that feature branch hadn’t been properly build-tested, that scheduled overnight deployment would then fail.

fail.png

So having noticed the new Build on Pull Request feature that comes with paid versions of dbtCloud, we’ve upgraded from the free version to the $100/month basic paid version and added an automated “continuous integration” build test to our feature branch development process so that it now looks like this:

DIAG2.png

To set this automated build test feature up, we first linked our Github account to dbtCloud and then created a new dbtCloud job that triggers when a user submits a pull request.

buildtest.png

8. Now, when I come to review the pull request for this feature branch, there’s an automated test added by dbtCloud that checks to see whether this new version of my dbt package deploys without errors.

buildstep.png

9. Looking at the details behind this automated build test, I can see that dbtCloud has created a dataset and runs through a full package deployment and test cycle, avoiding any risk of breaking our production dbt environment should that test deployment fail.

pass.png

10. Checking back at the status of the test build in Github, I can see that this test completed with no issues, and I’m therefore safe to merge the changes in this feature branch into our development master branch, ready for dbtCloud to pick-up these changes overnight and deploy them as scheduled into production.

testpass.png

Docs on this recent dbtCloud feature are online here, and you’ll need at least the basic $100/month paid version of dbtCloud for this feature to become available in the web UI. It’s also dependent on Github as the git hosting service, so not available as an option if you’re using Gitlab, BitBucket or CodeCommit.

Customer Journey and Lifetime Value Analytics using Looker, Google BigQuery, Stitch and dbt

One of the most common analytics use-cases I come across on client projects is to understand the “customer journey”; the series of stages prospects and customers typically go through from becoming aware of a company’s products and services through to consideration, conversion, up-sell and repurchase and eventually, to churn.

Customers typically interact with multiple digital and offline channels over their overall lifetime, and if we take the model of a consulting services business such as Rittman Analytics, a typical customer journey and use of those touch-points might look like the diagram below.

1_hMO8fEmwA6A3McyUfhWMZA.png

Each customer and prospect interaction with those touch-points sends signals about their buying intentions, product and service preferences and readiness to move through to the next stage in the purchase funnel.

1_V36-h1V-t7MDWZBG-Eqoaw.png

Creating visualization such as these in Looker typically involves two steps of data transformation either through a set of Looker (persistent) derived tables, or as we’ve done for our own operational analytics platform, through two dbt (Data Build Tool) models sequenced together as a transformation graph.

To start, we select a common set of columns from each digital touchpoint channel as brought in by our project data pipeline tool, Stitch; Hubspot for inbound and outbound marketing communctions together with sales activity and closes/loses; Harvest for billable and non-billable client days; Segment for website visits and so on. For each source we turn each type of customer activity they record into a type of event, details of the event and a monetary value (revenue, cost etc) that we can then use to measure the overall lifetime value of each client later on. 

1_e_DGbbE9s3R9i7GCfyY3wg.png

Note the expressions at the start of the model definition that define a number of recency measures (days since last billable day, last contact day and so on) along with another set that are used for defining customer cohorts and bucketing activity into months and weeks since those first engagement date.

{{
    config(
        materialized='table',
        partition_by='DATE(event_ts)'
    )
}}
SELECT
    *,
    {{ dbt_utils.datediff('last_billable_day_ts', current_timestamp(), 'day')}} AS days_since_last_billable_day,
    {{ dbt_utils.datediff('last_incoming_email_ts', current_timestamp(), 'day')}} AS days_since_last_incoming_email,
    {{ dbt_utils.datediff('last_outgoing_email_ts', current_timestamp(), 'day')}} AS days_since_last_outgoing_email,
    {{ dbt_utils.datediff('last_site_visit_day_ts', current_timestamp(), 'month')}} AS months_since_last_site_visit_day,
    {{ dbt_utils.datediff('last_site_visit_day_ts', current_timestamp(), 'week')}} AS weeks_since_last_site_visit_day
FROM
  (SELECT
      *,
      ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY event_ts) AS event_seq,
      MIN(CASE WHEN event_type = 'Billable Day' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_billable_day_ts,
      MAX(CASE WHEN event_type = 'Billable Day' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_billable_day_ts,
      MIN(CASE WHEN event_type = 'Client Invoiced' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_invoice_day_ts,
      MAX(CASE WHEN event_type = 'Client Invoiced' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_invoice_day_ts,
      MAX(CASE WHEN event_type = 'Site Visited' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_site_visit_day_ts,
      MAX(CASE WHEN event_type = 'Incoming Email' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_incoming_email_ts,
      MAX(CASE WHEN event_type = 'Outgoing Email' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_outgoing_email_ts,
      MIN(event_ts)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_contact_ts,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type = 'Billable Day' THEN date(event_ts) END)          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},MONTH) AS months_since_first_billable_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type = 'Billable Day' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},WEEK) AS weeks_since_first_billable_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type like '%Email%' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},MONTH) AS months_since_first_contact_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type like '%Email%' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},WEEK) AS weeks_since_first_contact_day,
      MAX(CASE WHEN event_type = 'Billable Day' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS billable_client,
      MAX(CASE WHEN event_type LIKE '%Sales%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS sales_prospect,
      MAX(CASE WHEN event_type LIKE '%Site Visited%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS site_visitor,
      MAX(CASE WHEN event_details LIKE '%Blog%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS blog_reader,
      MAX(CASE WHEN event_type LIKE '%Podcast%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS podcast_reader
  FROM
  -- sales opportunity stages
      (SELECT
            deals.lastmodifieddate AS event_ts,
          customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          deals.dealname AS event_details,
          deals.dealstage AS event_type,
          AVG(deals.amount) AS event_value
      FROM
          {  } AS customer_master
      LEFT JOIN
          {  } AS deals
          ON customer_master.hubspot_company_id = deals.associatedcompanyids
      LEFT JOIN
          {  } AS owners
          ON deals.hubspot_owner_id = CAST(owners.ownerid AS STRING)
      WHERE
          deals.lastmodifieddate IS NOT null
      {  }
      UNION ALL
  -- consulting days
      SELECT
            time_entries.spent_date AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          projects.name AS event_details,
          CASE WHEN time_entries.billable THEN 'Billable Day' ELSE 'Non-Billable Day' END AS event_type,
          time_entries.hours * time_entries.billable_rate AS event_value
      FROM
          {  } AS customer_master
      LEFT JOIN
          {  } AS projects
          ON customer_master.harvest_customer_id = projects.client_id
      LEFT JOIN
          {  } AS time_entries
          ON time_entries.project_id = projects.id
      WHERE
          time_entries.spent_date IS NOT null
      {  }
  UNION ALL
  -- incoming and outgoing emails
      SELECT
            communications.communication_timestamp AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          communications.communications_subject AS event_details,
          CASE WHEN communications.communication_type = 'INCOMING_EMAIL' THEN 'Incoming Email'
               WHEN communications.communication_type = 'EMAIL' THEN 'Outgoing Email'
               ELSE communications.communications_subject
          END AS event_type,
          1 AS event_value
      FROM
          {  } AS customer_master
      LEFT JOIN
          {  } AS communications
          ON customer_master.hubspot_company_id = communications.hubspot_company_id
      WHERE
          communications.communication_timestamp IS NOT null
      {  }
  UNION ALL
  -- sales opportunity stages
      SELECT
            invoices.issue_date AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          invoices.subject AS event_details,
          'Client Invoiced' AS event_type,
          SUM(invoices.amount) AS event_value
      FROM
          {  } AS customer_master
      LEFT JOIN
          {  } AS invoices
          ON customer_master.harvest_customer_id = invoices.client_id
      WHERE
          invoices.issue_date IS NOT null
      {  }
  UNION ALL
      SELECT
         invoices.issue_date AS event_ts,
         customer_master.customer_id AS customer_id,
         customer_master.customer_name AS customer_name,
         invoice_line_items.description AS event_details,
         'Client Credited' AS event_type,
          COALESCE(SUM(invoice_line_items.amount ), 0) AS event_value
      FROM
          {  } AS customer_master
      LEFT JOIN
          {  } AS invoices
          ON customer_master.harvest_customer_id = invoices.client_id
      LEFT JOIN
          {  } AS invoice_line_items
          ON invoices.id = invoice_line_items.invoice_id
      {  }
      HAVING
         (COALESCE(SUM(invoice_line_items.amount ), 0) < 0)
  UNION ALL
      SELECT 
              pageviews.timestamp AS event_ts,
              customer_master.customer_id  AS customer_id,
              customer_master.customer_name AS customer_name,
              pageviews.page_subcategory AS event_details,
              'Site Visited' AS event_type,
              sum(1) as event_value
      FROM 
          {{ ref('customer_master') }}  AS customer_master
     LEFT JOIN 
          {{ ref('pageviews') }} AS pageviews 
          ON customer_master.customer_name = pageviews.network
      {{ dbt_utils.group_by(n=5) }}
  UNION ALL
      SELECT
          *
      FROM
          (SELECT
              invoices.paid_at AS event_ts,
          customer_master.customer_id AS customer_id,
              customer_master.customer_name AS customer_name,
              invoices.subject AS event_details,
              CASE WHEN invoices.paid_at <= invoices.due_date THEN 'Client Paid' ELSE 'Client Paid Late' END AS event_type,
              SUM(invoices.amount) AS event_value
          FROM
              {  } AS customer_master
          LEFT JOIN {  }  AS invoices
              ON customer_master.harvest_customer_id = invoices.client_id
          WHERE
            invoices.paid_at IS NOT null
          {  }
          )
      )
  WHERE
      customer_name NOT IN ('Rittman Analytics', 'MJR Analytics')
  )

Deploying this model and then querying the resulting table, filtering on just a single customer and ordering rows by timestamp, you can see the history of all interactions by that customer with our sales and delivery channels.

1_KLvPEmYDL8kfH-u4brfj-g.png

You can also plot all of these activity types into a combined area and bar chart, for example, to show billable days initially growing from zero to a level that we maintain over several months, and with events such as client being invoiced, incoming and outgoing communications and the client actually paying in-full at the end.

1_CpYJkIbFnWL2PO7Rc5RtNQ.png

We can also do things such as restrict the event type reported on to billable days, and then use Looker’s dimension group and timespans feature to split the last set of weeks into days and week numbers to then show client utilisation over that period.

1_oCOLcQ-CJlq6ogPxO9jQRg.png

Of course we can also take those monetary values assigned to events and use them to calculate the overall revenue contribution for this customer, factoring in costs incurred around preparing sales proposals, answering support questions and investing non-billable days in getting a project over the line.

1_mHGgPI5mHPs_repqB3Nj1Q.png

Our second dbt model takes this event data and pivots each customer’s events into a series of separate columns, one for the first event in sequence, another for the second, another for the third and so on.

1_2ASOQfs1GYgHM3F36ozjRw.png

Again, code for this dbt model is available in our project git repo.

{{
    config(
        materialized='table'
    )
}}
WITH event_type_seq_final AS (
    SELECT
        customer_id,
        user_session_id AS event_type_seq,
        event_type,
        event_ts,
        is_new_session
    FROM
        (SELECT
            customer_id,
            event_type,
            last_event,
            event_ts,
            is_new_session,
            SUM(is_new_session) OVER (ORDER BY customer_id ASC, event_ts ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id,
            SUM(is_new_session) {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS user_session_id
        FROM
            (SELECT
                *,
                CASE WHEN event_type != LAG(event_type, 1) OVER (PARTITION BY customer_id ORDER BY event_ts ASC) OR last_event IS NULL THEN 1
                ELSE 0
                END AS is_new_session
            FROM
                (SELECT
                    customer_id,
                    customer_name,
                    CASE WHEN event_type LIKE '%Email%' THEN 'Presales'
                    WHEN event_type LIKE '%Bill%' THEN 'Delivery' ELSE event_type END AS event_type,
                    event_ts,
                    LAG(CASE WHEN event_type LIKE '%Email%' THEN 'Presales' WHEN event_type LIKE '%Bill%' THEN 'Delivery' ELSE event_type END, 1) OVER (PARTITION BY customer_id ORDER BY event_ts ASC) AS last_event
                FROM
                    {  }
                ORDER BY
                    customer_id ASC,
                    event_ts ASC,
                    event_type ASC
                ) last
             ORDER BY
                customer_id ASC,
                event_ts ASC,
                event_type ASC
           ) final
        ORDER BY
            customer_id ASC,
            is_new_session DESC,
            event_ts ASC
        )
    WHERE
        is_new_session = 1
    {  }
    ORDER BY
        customer_id,
        user_session_id,
        event_ts
)
SELECT
    customer_id,
    MAX(event_type_1) AS event_type_1,
    MAX(event_type_2) AS event_type_2,
    MAX(event_type_3) AS event_type_3,
    MAX(event_type_4) AS event_type_4,
    MAX(event_type_5) AS event_type_5,
    MAX(event_type_6) AS event_type_6,
    MAX(event_type_7) AS event_type_7,
    MAX(event_type_8) AS event_type_8,
    MAX(event_type_9) AS event_type_9,
    MAX(event_type_10) AS event_type_10
FROM
    (SELECT
        customer_id,
        CASE WHEN event_type_seq = 1 THEN event_type END AS event_type_1,
        CASE WHEN event_type_seq = 2 THEN event_type END AS event_type_2,
        CASE WHEN event_type_seq = 3 THEN event_type END AS event_type_3,
        CASE WHEN event_type_seq = 4 THEN event_type END AS event_type_4,
        CASE WHEN event_type_seq = 5 THEN event_type END AS event_type_5,
        CASE WHEN event_type_seq = 6 THEN event_type END AS event_type_6,
        CASE WHEN event_type_seq = 7 THEN event_type END AS event_type_7,
        CASE WHEN event_type_seq = 8 THEN event_type END AS event_type_8,
        CASE WHEN event_type_seq = 9 THEN event_type END AS event_type_9,
        CASE WHEN event_type_seq = 10 THEN event_type END AS event_type_10
    FROM
        event_type_seq_final
    )
{  }

Pivoting each customer’s events in this way makes it possible then to visualize the first n events for a given set of customers using Looker’s Sankey custom visualization, as shown in the final screenshot below.

Untitled 6.png

More details on our own internal use of Looker, Stitch, dbt and Google BigQuery to create a modern operational analytics platform can be found in this earlier blog post, and if you’re interested in how we might help you get started with Looker, check out our Services page.