Rittman Analytics is now a Segment Certified Implementation Partner

In our recent blog posts on How Rittman Analytics does Analytics: Modern BI Stack Operational Analytics and Customer Journey and Lifetime Value Analytics we talked about using services from our partners at Stitch, Fivetran and Segment to connect and integrate data ready for analysis with Looker.

So far we’ve worked with joint Segment/Looker/Rittman Analytics customers such as Let’s Do This to build customer and product analytics platforms on-top of Segment’s rich, event-level behavioural datasets. Now we’re pleased to announce that Rittman Analytics is now a Certified Implementation Partner for Segment, giving us the ability implement Segment Connections along with Segment Personas (Customer Data Platform) and Segment Protocols (Data Quality) on client analytics and wider digital marketing / digital transformation projects.

segpartner.png

Look-out for more information on how we’re using Segment technology together with modern, modular cloud analytics services from Looker, Stitch, Fivetran, Snowflake and Google Cloud Platform over the coming weeks, and if we can help with any of your Segment implementation questions just get in-touch at info@rittmananalytics.com.

Continuous Integration and Automated Build Testing with dbtCloud

If you read my recent blog post on how Rittman Analytics built our operational analytics platform running on Looker, Stitch, Google BigQuery and dbt, you’ll know that we built-out the underlying data infrastructure using a modular Extract, Transform and Load (ELT) design pattern, like this:

ra_analytics_architecture (1).png

We version-control our dbt development environment using git and Github.com, and do all new development beyond simple bug fixes as git feature branches, giving us a development process for dbt that looked like this:

original2.png

1. We’d start by cloning dbt git repo master branch from github.com to the developer’s workstation, which also would have dbt installed locally along with the Google Cloud SDK so that they can connect to our development BigQuery dataset.

clone.png

2. Create a new, local git branch for the new feature using the git CLI or a tool such as Github desktop

newnrahc.png

3. Develop the new feature locally using the developer’s install of dbt, committing any changes to the feature branch in that developer’s local git repo after checking that all dbt tests have run successfully.

locals-imac:ra_dw markrittman$ dbt run --models harvest_time_entries harvest_invoices --target dev
Running with dbt=0.13.1
Found 50 models, 8 tests, 0 archives, 0 analyses, 109 macros, 0 operations, 0 seed files, 35 sources

20:53:11 | Concurrency: 1 threads (target='dev')
20:53:11 | 
20:53:11 | 1 of 2 START table model ra_data_warehouse_dbt_dev.harvest_invoices.. [RUN]
20:53:13 | 1 of 2 OK created table model ra_data_warehouse_dbt_dev.harvest_invoices [OK in 1.80s]
20:53:13 | 2 of 2 START table model ra_data_warehouse_dbt_dev.harvest_time_entries [RUN]
20:53:15 | 2 of 2 OK created table model ra_data_warehouse_dbt_dev.harvest_time_entries [OK in 1.70s]
20:53:15 | 
20:53:15 | Finished running 2 table models in 5.26s.

Completed successfully

Done. PASS=2 ERROR=0 SKIP=0 TOTAL=2
locals-imac:ra_dw markrittman$ dbt test --models harvest_time_entries harvest_invoices --target dev
Running with dbt=0.13.1
Found 50 models, 8 tests, 0 archives, 0 analyses, 109 macros, 0 operations, 0 seed files, 35 sources

20:53:37 | Concurrency: 1 threads (target='dev')
20:53:37 | 
20:53:37 | 1 of 4 START test not_null_harvest_invoices_id....................... [RUN]
20:53:39 | 1 of 4 PASS not_null_harvest_invoices_id............................. [PASS in 1.50s]
20:53:39 | 2 of 4 START test not_null_harvest_time_entries_id................... [RUN]
20:53:40 | 2 of 4 PASS not_null_harvest_time_entries_id......................... [PASS in 0.89s]
20:53:40 | 3 of 4 START test unique_harvest_invoices_id......................... [RUN]
20:53:41 | 3 of 4 PASS unique_harvest_invoices_id............................... [PASS in 1.08s]
20:53:41 | 4 of 4 START test unique_harvest_time_entries_id..................... [RUN]
20:53:42 | 4 of 4 PASS unique_harvest_time_entries_id........................... [PASS in 0.83s]
20:53:42 | 
20:53:42 | Finished running 4 tests in 5.27s.

Completed successfully

4. All dbt transformations at this stage are being deployed to our development BigQuery dataset.

5. When the feature was then ready for deployment to our production BigQuery dataset, the developer would then push the changes in their local branch to the remote git repo, creating that branch if it didn’t already exist.

commit.png

6. Then they’d create a pull request using Github Desktop and the Github web UI summarising the changes and new features added by the development branch.

create_pr.png

7. I’d then review the PR, try and work out if the changes were safe to merge into the master branch and then accept the pull reques, and then overnight our dbtCloud service would clone that development git repo master branch and attempt to deploy the new set of transformations to our production BigQuery dataset.

And sometimes, if for whatever reason that feature branch hadn’t been properly build-tested, that scheduled overnight deployment would then fail.

fail.png

So having noticed the new Build on Pull Request feature that comes with paid versions of dbtCloud, we’ve upgraded from the free version to the $100/month basic paid version and added an automated “continuous integration” build test to our feature branch development process so that it now looks like this:

DIAG2.png

To set this automated build test feature up, we first linked our Github account to dbtCloud and then created a new dbtCloud job that triggers when a user submits a pull request.

buildtest.png

8. Now, when I come to review the pull request for this feature branch, there’s an automated test added by dbtCloud that checks to see whether this new version of my dbt package deploys without errors.

buildstep.png

9. Looking at the details behind this automated build test, I can see that dbtCloud has created a dataset and runs through a full package deployment and test cycle, avoiding any risk of breaking our production dbt environment should that test deployment fail.

pass.png

10. Checking back at the status of the test build in Github, I can see that this test completed with no issues, and I’m therefore safe to merge the changes in this feature branch into our development master branch, ready for dbtCloud to pick-up these changes overnight and deploy them as scheduled into production.

testpass.png

Docs on this recent dbtCloud feature are online here, and you’ll need at least the basic $100/month paid version of dbtCloud for this feature to become available in the web UI. It’s also dependent on Github as the git hosting service, so not available as an option if you’re using Gitlab, BitBucket or CodeCommit.

Drill to Detail Ep.66 'ETL, Incorta and the Death of the Star Schema' with Special Guest Matthew Halliday

Mark Rittman is joined by Matthew Halliday from Incorta in this latest episode of the Drill to Detail Podcast to talk about the challenge of ETL and analytics on complex relational OLTP data models, previous attempts to solve these problems with products such as Oracle Essbase and Oracle E-Business Suite Extensions for Oracle Endeca and how those experiences, and others, led to his current role as co-founder and VP of Products at Incorta.

Customer Journey and Lifetime Value Analytics using Looker, Google BigQuery, Stitch and dbt

One of the most common analytics use-cases I come across on client projects is to understand the “customer journey”; the series of stages prospects and customers typically go through from becoming aware of a company’s products and services through to consideration, conversion, up-sell and repurchase and eventually, to churn.

Customers typically interact with multiple digital and offline channels over their overall lifetime, and if we take the model of a consulting services business such as Rittman Analytics, a typical customer journey and use of those touch-points might look like the diagram below.

1_hMO8fEmwA6A3McyUfhWMZA.png

Each customer and prospect interaction with those touch-points sends signals about their buying intentions, product and service preferences and readiness to move through to the next stage in the purchase funnel.

1_V36-h1V-t7MDWZBG-Eqoaw.png

Creating visualization such as these in Looker typically involves two steps of data transformation either through a set of Looker (persistent) derived tables, or as we’ve done for our own operational analytics platform, through two dbt (Data Build Tool) models sequenced together as a transformation graph.

To start, we select a common set of columns from each digital touchpoint channel as brought in by our project data pipeline tool, Stitch; Hubspot for inbound and outbound marketing communctions together with sales activity and closes/loses; Harvest for billable and non-billable client days; Segment for website visits and so on. For each source we turn each type of customer activity they record into a type of event, details of the event and a monetary value (revenue, cost etc) that we can then use to measure the overall lifetime value of each client later on. 

1_e_DGbbE9s3R9i7GCfyY3wg.png

Note the expressions at the start of the model definition that define a number of recency measures (days since last billable day, last contact day and so on) along with another set that are used for defining customer cohorts and bucketing activity into months and weeks since those first engagement date.

{{
    config(
        materialized='table',
        partition_by='DATE(event_ts)'
    )
}}
SELECT
    *,
    {{ dbt_utils.datediff('last_billable_day_ts', current_timestamp(), 'day')}} AS days_since_last_billable_day,
    {{ dbt_utils.datediff('last_incoming_email_ts', current_timestamp(), 'day')}} AS days_since_last_incoming_email,
    {{ dbt_utils.datediff('last_outgoing_email_ts', current_timestamp(), 'day')}} AS days_since_last_outgoing_email,
    {{ dbt_utils.datediff('last_site_visit_day_ts', current_timestamp(), 'month')}} AS months_since_last_site_visit_day,
    {{ dbt_utils.datediff('last_site_visit_day_ts', current_timestamp(), 'week')}} AS weeks_since_last_site_visit_day
FROM
  (SELECT
      *,
      ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY event_ts) AS event_seq,
      MIN(CASE WHEN event_type = 'Billable Day' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_billable_day_ts,
      MAX(CASE WHEN event_type = 'Billable Day' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_billable_day_ts,
      MIN(CASE WHEN event_type = 'Client Invoiced' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_invoice_day_ts,
      MAX(CASE WHEN event_type = 'Client Invoiced' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_invoice_day_ts,
      MAX(CASE WHEN event_type = 'Site Visited' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_site_visit_day_ts,
      MAX(CASE WHEN event_type = 'Incoming Email' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_incoming_email_ts,
      MAX(CASE WHEN event_type = 'Outgoing Email' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_outgoing_email_ts,
      MIN(event_ts)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_contact_ts,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type = 'Billable Day' THEN date(event_ts) END)          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},MONTH) AS months_since_first_billable_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type = 'Billable Day' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},WEEK) AS weeks_since_first_billable_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type like '%Email%' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},MONTH) AS months_since_first_contact_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type like '%Email%' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},WEEK) AS weeks_since_first_contact_day,
      MAX(CASE WHEN event_type = 'Billable Day' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS billable_client,
      MAX(CASE WHEN event_type LIKE '%Sales%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS sales_prospect,
      MAX(CASE WHEN event_type LIKE '%Site Visited%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS site_visitor,
      MAX(CASE WHEN event_details LIKE '%Blog%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS blog_reader,
      MAX(CASE WHEN event_type LIKE '%Podcast%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS podcast_reader
  FROM
  -- sales opportunity stages
      (SELECT
            deals.lastmodifieddate AS event_ts,
          customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          deals.dealname AS event_details,
          deals.dealstage AS event_type,
          AVG(deals.amount) AS event_value
      FROM
          {{ ref('customer_master') }} AS customer_master
      LEFT JOIN
          {{ ref('deals') }} AS deals
          ON customer_master.hubspot_company_id = deals.associatedcompanyids
      LEFT JOIN
          {{ ref('hubspot_owners') }} AS owners
          ON deals.hubspot_owner_id = CAST(owners.ownerid AS STRING)
      WHERE
          deals.lastmodifieddate IS NOT null
      {{ dbt_utils.group_by(n=5) }}
      UNION ALL
  -- consulting days
      SELECT
            time_entries.spent_date AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          projects.name AS event_details,
          CASE WHEN time_entries.billable THEN 'Billable Day' ELSE 'Non-Billable Day' END AS event_type,
          time_entries.hours * time_entries.billable_rate AS event_value
      FROM
          {{ ref('customer_master') }} AS customer_master
      LEFT JOIN
          {{ ref('harvest_projects') }} AS projects
          ON customer_master.harvest_customer_id = projects.client_id
      LEFT JOIN
          {{ ref('harvest_time_entries') }} AS time_entries
          ON time_entries.project_id = projects.id
      WHERE
          time_entries.spent_date IS NOT null
      {{ dbt_utils.group_by(n=6) }}
  UNION ALL
  -- incoming and outgoing emails
      SELECT
            communications.communication_timestamp AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          communications.communications_subject AS event_details,
          CASE WHEN communications.communication_type = 'INCOMING_EMAIL' THEN 'Incoming Email'
               WHEN communications.communication_type = 'EMAIL' THEN 'Outgoing Email'
               ELSE communications.communications_subject
          END AS event_type,
          1 AS event_value
      FROM
          {{ ref('customer_master') }} AS customer_master
      LEFT JOIN
          {{ ref('communications') }} AS communications
          ON customer_master.hubspot_company_id = communications.hubspot_company_id
      WHERE
          communications.communication_timestamp IS NOT null
      {{ dbt_utils.group_by(n=5) }}
  UNION ALL
  -- sales opportunity stages
      SELECT
            invoices.issue_date AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          invoices.subject AS event_details,
          'Client Invoiced' AS event_type,
          SUM(invoices.amount) AS event_value
      FROM
          {{ ref('customer_master') }} AS customer_master
      LEFT JOIN
          {{ ref('harvest_invoices') }} AS invoices
          ON customer_master.harvest_customer_id = invoices.client_id
      WHERE
          invoices.issue_date IS NOT null
      {{ dbt_utils.group_by(n=5) }}
  UNION ALL
      SELECT
         invoices.issue_date AS event_ts,
         customer_master.customer_id AS customer_id,
         customer_master.customer_name AS customer_name,
         invoice_line_items.description AS event_details,
         'Client Credited' AS event_type,
          COALESCE(SUM(invoice_line_items.amount ), 0) AS event_value
      FROM
          {{ ref('customer_master') }} AS customer_master
      LEFT JOIN
          {{ ref('harvest_invoices') }} AS invoices
          ON customer_master.harvest_customer_id = invoices.client_id
      LEFT JOIN
          {{ ref('harvest_invoice_line_items') }} AS invoice_line_items
          ON invoices.id = invoice_line_items.invoice_id
      {{ dbt_utils.group_by(n=5) }}
      HAVING
         (COALESCE(SUM(invoice_line_items.amount ), 0) < 0)
  UNION ALL
      SELECT 
              pageviews.timestamp AS event_ts,
              customer_master.customer_id  AS customer_id,
              customer_master.customer_name AS customer_name,
              pageviews.page_subcategory AS event_details,
              'Site Visited' AS event_type,
              sum(1) as event_value
      FROM 
          {{ ref('customer_master') }}  AS customer_master
     LEFT JOIN 
          {{ ref('pageviews') }} AS pageviews 
          ON customer_master.customer_name = pageviews.network
      {{ dbt_utils.group_by(n=5) }}
  UNION ALL
      SELECT
          *
      FROM
          (SELECT
              invoices.paid_at AS event_ts,
          customer_master.customer_id AS customer_id,
              customer_master.customer_name AS customer_name,
              invoices.subject AS event_details,
              CASE WHEN invoices.paid_at <= invoices.due_date THEN 'Client Paid' ELSE 'Client Paid Late' END AS event_type,
              SUM(invoices.amount) AS event_value
          FROM
              {{ ref('customer_master') }} AS customer_master
          LEFT JOIN {{ ref('harvest_invoices') }}  AS invoices
              ON customer_master.harvest_customer_id = invoices.client_id
          WHERE
            invoices.paid_at IS NOT null
          {{ dbt_utils.group_by(n=5) }}
          )
      )
  WHERE
      customer_name NOT IN ('Rittman Analytics', 'MJR Analytics')
  )

Deploying this model and then querying the resulting table, filtering on just a single customer and ordering rows by timestamp, you can see the history of all interactions by that customer with our sales and delivery channels.

1_KLvPEmYDL8kfH-u4brfj-g.png

You can also plot all of these activity types into a combined area and bar chart, for example, to show billable days initially growing from zero to a level that we maintain over several months, and with events such as client being invoiced, incoming and outgoing communications and the client actually paying in-full at the end.

1_CpYJkIbFnWL2PO7Rc5RtNQ.png

We can also do things such as restrict the event type reported on to billable days, and then use Looker’s dimension group and timespans feature to split the last set of weeks into days and week numbers to then show client utilisation over that period.

1_oCOLcQ-CJlq6ogPxO9jQRg.png

Of course we can also take those monetary values assigned to events and use them to calculate the overall revenue contribution for this customer, factoring in costs incurred around preparing sales proposals, answering support questions and investing non-billable days in getting a project over the line.

1_mHGgPI5mHPs_repqB3Nj1Q.png

Our second dbt model takes this event data and pivots each customer’s events into a series of separate columns, one for the first event in sequence, another for the second, another for the third and so on.

1_2ASOQfs1GYgHM3F36ozjRw.png

Again, code for this dbt model is available in our project git repo.

{{
    config(
        materialized='table'
    )
}}
WITH event_type_seq_final AS (
    SELECT
        customer_id,
        user_session_id AS event_type_seq,
        event_type,
        event_ts,
        is_new_session
    FROM
        (SELECT
            customer_id,
            event_type,
            last_event,
            event_ts,
            is_new_session,
            SUM(is_new_session) OVER (ORDER BY customer_id ASC, event_ts ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id,
            SUM(is_new_session) {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS user_session_id
        FROM
            (SELECT
                *,
                CASE WHEN event_type != LAG(event_type, 1) OVER (PARTITION BY customer_id ORDER BY event_ts ASC) OR last_event IS NULL THEN 1
                ELSE 0
                END AS is_new_session
            FROM
                (SELECT
                    customer_id,
                    customer_name,
                    CASE WHEN event_type LIKE '%Email%' THEN 'Presales'
                    WHEN event_type LIKE '%Bill%' THEN 'Delivery' ELSE event_type END AS event_type,
                    event_ts,
                    LAG(CASE WHEN event_type LIKE '%Email%' THEN 'Presales' WHEN event_type LIKE '%Bill%' THEN 'Delivery' ELSE event_type END, 1) OVER (PARTITION BY customer_id ORDER BY event_ts ASC) AS last_event
                FROM
                    {{ ref('customer_events') }}
                ORDER BY
                    customer_id ASC,
                    event_ts ASC,
                    event_type ASC
                ) last
             ORDER BY
                customer_id ASC,
                event_ts ASC,
                event_type ASC
           ) final
        ORDER BY
            customer_id ASC,
            is_new_session DESC,
            event_ts ASC
        )
    WHERE
        is_new_session = 1
    {{ dbt_utils.group_by(n=5) }}
    ORDER BY
        customer_id,
        user_session_id,
        event_ts
)
SELECT
    customer_id,
    MAX(event_type_1) AS event_type_1,
    MAX(event_type_2) AS event_type_2,
    MAX(event_type_3) AS event_type_3,
    MAX(event_type_4) AS event_type_4,
    MAX(event_type_5) AS event_type_5,
    MAX(event_type_6) AS event_type_6,
    MAX(event_type_7) AS event_type_7,
    MAX(event_type_8) AS event_type_8,
    MAX(event_type_9) AS event_type_9,
    MAX(event_type_10) AS event_type_10
FROM
    (SELECT
        customer_id,
        CASE WHEN event_type_seq = 1 THEN event_type END AS event_type_1,
        CASE WHEN event_type_seq = 2 THEN event_type END AS event_type_2,
        CASE WHEN event_type_seq = 3 THEN event_type END AS event_type_3,
        CASE WHEN event_type_seq = 4 THEN event_type END AS event_type_4,
        CASE WHEN event_type_seq = 5 THEN event_type END AS event_type_5,
        CASE WHEN event_type_seq = 6 THEN event_type END AS event_type_6,
        CASE WHEN event_type_seq = 7 THEN event_type END AS event_type_7,
        CASE WHEN event_type_seq = 8 THEN event_type END AS event_type_8,
        CASE WHEN event_type_seq = 9 THEN event_type END AS event_type_9,
        CASE WHEN event_type_seq = 10 THEN event_type END AS event_type_10
    FROM
        event_type_seq_final
    )
{{ dbt_utils.group_by(n=1) }}

Pivoting each customer’s events in this way makes it possible then to visualize the first n events for a given set of customers using Looker’s Sankey custom visualization, as shown in the final screenshot below.

Untitled 6.png

More details on our own internal use of Looker, Stitch, dbt and Google BigQuery to create a modern operational analytics platform can be found in this earlier blog post, and if you’re interested in how we might help you get started with Looker, check out our Services page.

How Rittman Analytics does Analytics: Modern BI Stack Operational Analytics using Looker, Stitch, dbt and Google BigQuery

As well as creating modern BI stack analytics solutions for clients such as Florence, Let’s Do This and Colourpop using technologies from our partners at Looker, Stitch, Segment, Fivetran and Snowflake, it’s probably no surprise that we’ve used those same technologies and project experience to build-out our own internal analytics platform, covering data and operational analytics use-cases such as

  • Showing high-level KPIs that show progress towards our four main business objectives

  • Providing actionable context for these high-level KPIs and the ability to drill-into the detail behind them

  • Enabling ad-hoc querying and analysis of our complete dataset by technical and non-technical users

  • Providing a trusted, integrated analysis-ready data platform for other internal projects

overview.png

I thought it might be of interest to go through how we’ve used the tools and techniques we use for client projects but in this case to deliver an analytics solution for our own internal use, sharing some of our thinking around how we defined our underlying KPI framework, the tools and design patterns we used when building out the data management part of the platform, and some of the front-end components and data visualizations we’ve put together using the Looker BI tool.

Defining your KPI Framework

When creating any sort of analytics solution the first thing you need to understand is - what are your business objectives; what do you need to measure to understand your progress towards those objectives; and how do you then define success?

Key Performance Indicators (KPIs) and measures within our analytics platform are based around what's needed to measure our progress towards four business objectives we set ourselves for FY2019:

  1. Increase Billing Revenue

  2. Increase Profitability

  3. Increase customer retention, and

  4. Increase our operational and delivery efficiency

In-practice, we “certify” data around these four primary KPIs and also a number of secondary measures, with data in the platform normally up-to-date as of the previous business day and with historical data going back to when we started the business.

kpi.png

Other metrics, measures and dimension attributes are provided alongside these certified KPIs and measures and are considered “best endeavours” in terms of our certifying their accuracy, and users then of course able to create their own calculations for use in reports and analyses outside of this formal curation.

Platform Architecture

At a high-level our internal analytics platform extracts data on a regular basis from the SaaS-based services we use to run our business operations, joins that data together so that each dataset uses the same definition of clients, products and other reference data, then makes that combined dataset available for analysis as a set of business KPIs and an ad-hoc query environment.

ra_analytics_architecture.png

The design of our internal analytics platform was based around a modular BI technology stack, ELT ("Extract, Land and Transform") data warehouse loading pattern and modern software development approach that we use on all of our client analytics development engagements.

Data Pipeline and Source Connectivity

We’re digital-first, paperless and run our business operations using cloud-hosted software services such as:

  • Harvest for timesheets, project invoicing and client-rechargeable expenses

  • Harvest Forecast for project resource planning and revenue forecasts

  • Xero for accounting and payroll

  • Hubspot CRM for sales and contact management

  • Docusign for contract management

  • Jira for project management and issue tracking

  • BambooHR for human resources and recruitment

On the delivery side, Rittman Analytics partners with four different data pipeline-as-a-Service software vendors, each of which targets a different segment of the market and that segment’s particular use-cases:

  • Stitch, who focus on small to mid-market startup customers with data engineers who value service extensibility and their land-and-expand pricing model ($100/month for up to five sources, including Hubspot and Xero certified and Harvest community-supported connectors)

  • Fivetran, aimed more at mid-market to enterprise customers and the data analyst personas, with a more turnkey solution and increased levels of service and coverage of enterprise data sources

  • Segment, more of an "enterprise service bus" for event-level digital data streams, connecting multiple event producers with downstream event consumers and offering additional features around customer journey mapping, personas and schema validation

  • Supermetrics, aimed at growth hackers and digital marketers with a focus on simple self-service setup, comprehensive data dictionaries and a comprehensive catalog of advertising, marketing and social data sources delivered through a Google Sheets add-in or more recently, a direct connector into Google BigQuery through the GCP Marketplace.

We chose to use Stitch's service for our data pipeline for this internal project due to their out-of-the-box ability to source data from Harvest and Harvest Forecast; functionally both Fivetran and Stitch would meet our needs other than support for these two particular sources, and we've used Fivetran extensively on other internal and client projects in the past.

stitch3.png

We also use Segment's event tracking service to track visitor activity on our website and land those raw tracking events also in our BigQuery project, and intend to make use of Segment's Personas product in due course to enable us to build out 360-degree views of our clients, site visitors and prospects based on their interactions across all of our various digital touchpoints (web, social media, inbound and outbound lead generation, chat etc)

Data Transformation, Data Integration and Orchestration

We transform and integrate raw data that Stitch syncs from each of our SaaS application sources into an integrated, query-orientated data warehouse dataset using the open-source dbt toolkit.

Once we'd decided to work with a data pipeline-as-a-service such as Stitch together with a SQL-based data management platform like Google BigQuery, the decision to transform and integrate our data via a series of SQL SELECT was the obvious next design choice; by using dbt and version-controlling our scripts in a Github repository we increased our productivity when developing these transformation, adhered to modern software design principles and avoided cut-and-paste scripting in-favour of templated, maintainable code.

Screenshot_2019-05-07_at_19.44.16.png

Transformation steps in dbt are basically SQL SELECTs with references to metadata models replacing hard-coded table names and use of macros as reusable, often cross-database transformation components giving three main benefits over hand-written, author-specific code:

  • Increased analyst productivity through leveraging libraries of best-practice transformation components

  • More maintainable loading processes with concise logic that “doesn’t repeat yourself” (DRY)

  • Portable code that can be deployed on BigQuery today or Snowflake tomorrow if required

For example, as shown in the example code snippet below.

SELECT
        *
    FROM (
        SELECT
            harvest_user_id,
            a.id,
            allocation,
            project_id,
            start_date,
            end_date,
            {{ dbt_utils.datediff(start_date, end_date, 'day')}} +1 AS forecast_days,
            a._sdc_sequence ,
            MAX(a._sdc_sequence) OVER (PARTITION BY a.id ORDER BY a._sdc_sequence 
                RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS latest_sdc_sequence
        FROM
            {{ ref('harvest_forecast_assignments') }} a
        INNER JOIN
            {{ ref('harvest_forecast_people') }} p
            ON person_id = p.id
        )
    WHERE
        _sdc_sequence = latest_sdc_sequence
      )

In addition to the open-source dbt toolkit we also use the commercial dbtCloud service from our friends at Fishtown Analytics, mainly in order to then be able to execute our transformation graph in the cloud, like this:

dbt_cloud.png

dbtCloud also hosts the data dictionary automatically generated from the metadata in our dbt package, and diagrams the transformation graph and data lineage we’ve defined through the dependencies we created between each of the models in the package.

Screenshot_2019-05-07_at_13.00.00.png

As well as standardising each data source and mapping each source's set of customer references into a single, master customer lookup table that allows us to analyze operations from sales prospect through project delivery to invoicing and payment, we also turn every customer touchpoint into events recorded against a single view of each customer, like this:

Screenshot_2019-05-07_at_20.12.37.png

We then assign a financial revenue or cost value to each event in the customer journey, calculate metrics such as time between interactions (to measure engagement), value delivered to the client and use that information to better understand and segment our clients so as to provide the advice and service they're looking for at this particular stage in their lifecycle and relationship with our team.

event_model.png

Analytics and Data Visualization

We then use Looker as our primary analytics tool over the data prepared by dbt and stored in Google BigQuery. Everyone's homepage in Looker is set by default to show those four main operational KPIs with comparison to target and with the last six months’ trend history, providing the context when combined with Looker’s ability to go deep into the detail behind those headline numbers so that users can decide and action on what they’re seeing (note that I’ve altered and obfuscated our actual numbers in the next few screenshots)

dashboard2.png

We've also created a single business operations-wide Looker explore that enables users to analyze and explore all aspects of our sales, delivery and financial relationship with clients, suppliers, partners and prospects.

In the example below we've again using the event model and applying sequencing to those events bucketed by months since first billable day, to help us understand how the the demand decay curve looks for each cohort of clients we've taken on over the first year we've been operating.

decay.png

Embedded Analytics using Powered by Looker

We also link fields in our Looker explore back to the online services that can provide more background detail on a metric displayed on the screen, in the example below linking back to Harvest to see details of the project displayed in the explore data visualisation.

drilltoapp.png

Finally, for team members who are more likely to be working within our knowledge-sharing and internal company portal site, we use Powered by Looker embedded analyics within Notion, connecting our internal analytics metrics and insights directly into the productivity and workflow tools used by the team, bringing data and analytics to the problems they’re looking to solve rather than the other way around.

notion.png

So, hopefully this look behind the scenes at how we use these modern analytics stack technologies for our own analytics needs has been useful, and feel free to leave comments or contact me at mark.rittman@rittmananalytics.com if any of this would be applicable to your business. In the meantime, the dbt and Looker git repos used to build-out our platform are available as public git repos on Github:

Feel free to fork, improve on what we’ve done, submit PRs or just see a bit more of the detail behind the examples in this post.

Supermetrics, Google BigQuery and Data Pipelines for Digital Marketers

Chances are that around the corner from the analytics project you’re working on that’s using SaaS data piped-into your data warehouse using a service from Stitch or FIvetran, there’s a marketing analyst using Google Sheets and an add-in from Supermetrics to analyze their channel advertising spend.

Supermetrics are the biggest analytics tool vendor company you’ve never heard, and that’s because their products are aimed squarely at marketers, not data engineers or data analysts, and within that particular market they’re massive; what they sell is easy-to-setup connectivity to all the major advertising and social networks directly from within Google Sheets in the form of the Supermetrics Google Sheets Add-In, a product we’ve used in the past when looking to to understand how well we place in Google search results for keywords relevant to our business.

googlesheets.png

Supermetrics do two things well; they provide connectivity to all the major ad networks that marketers’ use (and there’s lots of them beyond the basic Adwords, Facebook Ads and Snapchat Ads that we’re most familiar with), and they focus exclusively on the non-technical, marketing user rather than data engineers and data analysts, providing in-addition to their Google Sheets add-in a connector for Google Data Studio, for example.

But although you can use Google Sheets as federated sources for Google BigQuery tables and theoretically you could then use the scheduled data refresh feature in Supermetrics’ add-in to regularly download new data and thereby create a sort-of data pipeline into your warehouse, in-practice this probably requires a bit more knowledge of how BigQuery works than the average marketer is likely to have.

So I was particularly interested to see that one of the new BigQuery features announced at Google Cloud Next’19 last week was a number of Supermetrics-provided data connectors for BigQuery in the Google Cloud Marketplace, providing data pipeline-as-a-service connectivity from ad network sources such as Bing Ads, LinkedIn ads, AdRoll and others and with data being fed directly into Google BigQuery using their Data Transfer service.

supermetricspage.png

Some of these sources are also covered by new connectors from Fivetran also in this same Google Cloud Marketplace, so what’s the difference between a data pipeline provided by Supermetrics compared to one from Fivetran or Stitch? Setting up a Supermetrics data pipeline using Google BigQuery’s data transfer service is pretty straightforward and not that different from setting up a data pipeline using Fivetran or Google’s own underlying data transfer technology, as shown in the screenshots below:

steps.png

Its when you look at how Supermetrics land data into BigQuery though that you see their difference in approach; instead of presenting you with several tables of data mapped to the structure of the API calls for each service, the Supermetrics BigQuery tables are typically denormalized (flattened into one big table of data) and come with extensive and verbose column descriptions and metadata, making it much easier for marketers to know what each column is used for and the precise definition of all the metrics and dimension attributes.

metadata.png

Pricing is per data source with a 14-day free trial at the start, to me it looks like Supermetrics are looking to position their pipeline-as-a-service in-between Stitch and Fivetran’s in-terms of pricing but with the service aimed at marketers, not data engineers. Details of pricing and pricing metrics are on the Supermetrics support page and full details of the service are on the Supermetrics for BigQuery product homepage.

Join us at Looker JOIN 2019, London on April 9th - The Brewery, 52 Chiswell Street

Looker JOIN London is coming to London next week and Rittman Analytics will be at the event as one of the Silver Sponsors - come and visit us at Stand 10, right at the front of the main hall just as you walk in!

JOIN London is an exciting opportunity to meet with Looker experts, data mavericks and industry thought leaders, all on-hand to share proven insights on how to become more data-driven. Get ready for thought-provoking presentations, informative discussions and interactive sessions, adopted from the best of Looker’s JOIN conferences that originally ran in San Francisco, USA.

We’ll be there all day and keen to hear about your data initiatives, project needs or just to have a chat about Looker, analytics and driving value from data. Registration is free and still open, check-out the event details page for more information.

mjr analyticsComment
Released Today: The M.O.M.E. Project - Multiple OLAP Machine Emulator

On this day last year we broke the news on Oracle Discoverer Cloud Service; this year we go one-better with the announcement of a new open-source software and homebrew hardware project now hitting its 1.0.0 release milestone: M.O.M.E. - Multiple OLAP Machine Emulator, sponsored by Rittman Analytics.

From the project homepage on Github:

console.png

“MOME is a framework for emulating multi-dimensional Online Analytical Processing ("MOLAP") engines, 4GL programming languages and the servers and OLAP applications that ran on those engines.

MOME's purpose is to preserve decades of multi-dimensional MOLAP engine and software history. As analytical processing technology continues to rush forward, MOME prevents this important "vintage" technology from being lost and forgotten. This is achieved by documenting the MOLAP engine, its multi-dimensional storage format and 4GL query languages, along with the software and sample databases for those engines such as Microsoft OLAP Services 7.0 AdventureWorks.

Over time, MOME (originally stood for Multiple OLAP Machine Emulator) absorbed the sister-project MESS (Multi-dimensional Engine Super System), so MOME now documents a wide variety of (mostly vintage) MOLAP, Relational OLAP ("ROLAP") and now Hybrid OLAP "HOLAP" engines in addition to the MOLAP engines and desktop applications that were its initial focus.”

MJR Analytics is now Rittman Analytics ... and an Update on our First Six Months

As of last Friday our company name has changed from MJR Analytics to Rittman Analytics. Nothing else has changed in-terms of ownership or services we provide but we wanted to take advantage of our ability now to put our name and reputation front-and-centre of our branding and website. All emails to our old @mjr-analytics.com domain will automatically redirect, and any links pointing to https://www.mjr-analytics.com will automatically redirect to http://rittmananalytics.com.

Other than changing our name, since our launch last September we've worked with a number of great customers together with our vendor partners, with five of those projects written-up as case studies and customer quotes in the Customers section of our website:

  • Colourpop : eCommerce and customer analytics using Shopify, Redshift and Looker

  • Let's Do This : eCommerce/conversion analytics using Segment, Redshift and Looker

  • Florence : Operational analytics, retention and RFM segmentation using Looker and PostgreSQL

  • INTO University Partnerships : Digital Analytics Strategy for MS PowerBI and Heap Analytics

  • Qubit : Data Engineering and customer/personalization analytics using Google BigQuery and Looker

We're also co-chaired and helped organize the recent Looker London Dev Meetup together with Jon Palmer from GoCardless, and presented a short session on Improving your User Engagement with Looker’s new System Activity Model - look out for the next Meetup in June later this year!

Looker is our primary, strategic vendor partner and since our launch last year we’ve also partnered with selected vendors that provide features complementary to Looker:

  • Segment for integrating event-level data sources, single customer view and data quality

  • Fivetran for no-ops data pipelines

  • Qubit for consumer behavioural, segmentation and personalization analytics

  • Snowflake for no-ops cloud data warehousing-as-a-service

  • Oracle for enterprise-level analytics and ERP/CRM integration

We’re also now recruiting for our first permanent consulting roles to work alongside myself and our team of experienced contract developers and data scientist, and we're looking forward to being one of the sponsors for the upcoming Looker Join 2019 event in London - see some of you there if not sooner!

Mark RittmanComment
Drill to Detail Podcast Returns with Ep.60 'A Deeper Look Into Looker' With Special Guest Lloyd Tabb
DTDv3 iTunes Logo with Border.jpeg

One of our CEO and founder Mark Rittman’s side projects is the Drill to Detail Podcast, a podcast series where Mark interviews a special guest each episode to talk about their work, their products and their take on the analytics, big data and data management industry.

Drill to Detail is now on its 60th episode and now starts its third season with Mark joined by Lloyd Tabb, Founder, CTO and Chairman of Looker to talk about the foundational story of Looker and LookML, query latency and semantic models, analytic engines and code IDEs, analytics developer workflows and the rise of cloud elastically-scalable databases, packaged applications and embedded analytics and why learning (and loving) are the long-term keys to analytics and business success.

mjr analyticsComment