Posts tagged Looker
HubSpot Data Actions, Harvest Analytical Workflows and Looker Data Platform

At Rittman Analytics we use a number of SaaS applications to run our back office operations including HubSpot CRM for sales tracking, Harvest for timesheets and resource planning, Jira for project management and Xero for invoicing and accounting. Back in May 2019 I blogged about how we used services from Stitch, Fivetran and dbt to extract, transform and combine data from those services into a data warehouse running on Google BigQuery and since then we’ve built-out a number of finance, operations and customer success dashboards that provide our back office team with valuable insights into the performance of our business.

dashboards.png

But Looker is more than just an ad-hoc query and dashboarding tool, it’s a full data platform that integrates with SaaS applications such as Harvest, HubSpot, Xero and Salesforce at multiple levels:

  • At a data-source level, querying and analyzing data from those application sources

  • Within the Looker explore UI using features such as links, data actions and the Action Hub

  • Within business workflows, providing analytical insights and trusted metrics through the Looker API

Adding Links to HubSpot Deal and Company Pages as Explore Menu Items

One example of this is how we’ve added menu options in the Looker explore to provide direct links into HubSpot for viewing the details of a deal or a company within that application, and for our sales manager to request an update on a deal where there’s been no contact from either side for a while.

hubspot1.png

This type of simple integration is done through the “link” property in LookML that you can add to dimension and measure definitions, as I’ve done in the example below for our HubSpot opportunity name dimension.

dimension: opportunity_name {
    type: string
    sql: $.dealname ;;
    link: {
      label: "View Deal in Hubspot"
      url: "https://app.hubspot.com/contacts/4402794/deal/{  }/"
      icon_url: "http://app.hubspot.com/favicon.ico"
    }
}

Then when a user clicks on a field in an explore or a dashboard tile they’ll be able to click-through to HubSpot to view all of the context of the deal; if like us you use Google Apps authentication for both Looker and our SaaS applications you’ll be taken straight into HubSpot without having to login first.

link.png

Creating Looker Data Actions to Update HubSpot Deal Properties

A more advanced form of integration between Looker and HubSpot that we’ve setup is the ability for users to update details of a HubSpot deal directly from within Looker, a good example of how you can make your Looker content more “actionable”.

data_actions.gif

This type of integration involves another type of LookML dimension and measure property called an “action”. I first talked about actions in a blog post titled “Using Looker Data Actions to Make Monzo Spend Analysis More Interactive … and Actionable” back in 2018 where I used the feature to call services such as Google Maps, Google My Business and via Zapier, Google Tasks; in the case of our HubSpot integration we’ve again use Zapier for the webhook service that our data actions call, setting up a three stage Zapier “Zap” that takes the details of the deal and the values that need changing from the webhook payload, then searches for the deal in HubSpot that needs updating and then applies those changes to the deal record.

dealstage.png

Adding Data Action Menu Items to Request HubSpot Deal Updates via GMail

Another example of this type of data action integration we’ve setup is between Looker, HubSpot and GMail, giving us the ability to send an email to request a status update from the team directly from the Looker dashboard.

dbm.png

Leveraging the Looker API to Provide KPIs for Harvest Workflows Automated using Zapier

As well as providing hooks into HubSpot for managing the deal process directly from within Looker, we also use Looker to inform and provide trusted business key performance indicators (KPIs) for our automated workflows. An example of this is when we’ve closed a new deal and the salesperson or operations team then sets up a new project in Harvest to record our time and expenses against, and something we want to keep a close eye on is situations where a salesperson has closed the deal but done so by agreeing a daily rate below the rate we’d normally charge.

To keep an eye on this type of deal being agreed and to make sure the salesperson involved knows that the discount will be coming out of their commission, we’ve setup another Zapier workflow that detects new Harvest projects being created, runs a look within Looker that returns the average rate charged across all projects over the past three months and sends a message to the salesperson concerned if the deal rate is below this average.

revenuebot.png

In the future we can change how this average rate is calculated, introduce other data and analytics into the calculation and by using Looker we’re making sure these analytic insights are governed and delivered “as-a-service” to the requesting business process. If you’re looking to use Looker Data Platform to integrate, inform and help orchestrate your business processes then drop us an email and we’d be happy to share our experiences so far.

Xero Financial Reporting in Looker using G-Accon for Xero, BigQuery and dbt

We use Xero as our accounting package at Rittman Analytics and thought it’d be useful to bring in some of our key financial and performance metrics from Xero’s Profit and Loss report (Income Statement for our US readers) into our Looker environment, so that we can monitor gross and net revenue trends over time, check expenses and staff costs are within budget and keep an eye on our most important financial KPI, Net Margin %.

Although Xero is a supported data source for Stitch, Fivetran and most other data pipelines-as-a-service, trying to reconstruct a P&L report from all of the Xero API tables whilst applying manual journals, categorising accounts into revenue or costs and coming-up with the same numbers as Xero’s own reports is fairly tricky; another approach that we’ve used with success is to use a Google Sheets add-in such as G-Accon for Xero to schedule exports of Xero’s P&L report into a Google Sheet and then access the results using a Google BigQuery external table.

g-accon.png

When you’ve set the Xero report export up within Google Sheets and G-Accon for Xero, the next step is to setup a federated table in Google BigQuery that maps to the Google Sheets sheet that you’ve just setup. Remember to share the Google Sheets file with the GCP service account that your Looker connection uses when accessing Google BigQuery data otherwise you’ll get a permissions issue when trying to query the Sheets data from Looker.

bq.png

Our next step was then to use dbt (“Data Build Tool”) to transform Xero’s account-level P&L numbers as landed into our base BigQuery federated tables into aggregated and derived revenue, cost of sales and overheads KPIs along with net and gross profit, margins and so on.

dbt.png

Now our financial numbers and P&L account details are in Looker as additional views within our main operational analytics LookML model we can start by reproducing the original Profit & Loss report using Looker’s new beta-release Table-Next visualization, giving us subtotals for the account categories as you’d normally expect to see in a P&L-style financial report (numbers obfuscated, in-case you were wondering).

pandl.png

Then to accompany the detail-level P&L report we then created a financial performance dashboard, some of the contents coming from the derived and aggregated financial KPIs table we created beforehand in dbt, some of the numbers (discretionary expense tracking over time for example) coming from the base P&L report numbers. Note how we’ve used Looker’s recently-added trellis charting feature to break expenses spending out by the discretionary spend categories we’re most interested in, I think this works well for this type of analysis.

dashboard.png

Whether you go down the path of full Xero data integration into your warehouse using a service such as Stitch or Fivetran or if you choose the report-based export approach we used here, really comes down to whether you need granular access to every bit of your accounting data or some high-level (and processed) numbers as we did in this particular example.

We’ve done both on projects for customers and ourselves in the past, so if you’re interested in bringing in Xero or other financial accounting data into your Looker analytics platform, send us an email or call our office on +44 1273 234690 and we’ll be happy to talk through the various options with you.

New Features in Looker 6.16 : Conditional Alerts (Beta), Content Curation (Beta) and LookML IDE Folders

Looker release 6.16 is rolling-out to customers right now and comes with three new features that you might find interesting:

  • Conditional Alerts (beta),

  • Content Creation (beta), and

  • Folders with the LookML IDE

Note that to enable beta features such as conditional alerts and content creation boards in Looker you’ll need to log in as an admin to specifically enable them before they’re available for use, and as experimental features they are not fully developed and may significantly change or be completely removed in future releases.

Conditional Alerts (Beta)

You can create a form of “alert” in Looker right now by scheduling a look to be sent to you only when a report returns rows, doesn’t return rows or returns a different set of rows than last time, but this is a pretty obscure feature that most end-users won’t have heard of. The new Conditional Alerts (beta) feature in Looker 6.16 makes creating alerts much simpler and is enabled from the Labs section of the Admin page in Looker.

Untitled 3.png

Once you’ve enabled this beta feature, you’ll then see “alert” icons in the top right-hand corner of every dashboard item, like this:

Untitled 4.png

Click on the bell icon to define the alert. In the example below I’ve configured the alert to send me an email if net margin has dropped below 20% at the end of the previous day, with the email going out at 9am next day to give me a heads-up that action needs to be taken - but also saving me the time of scanning the dashboard each day just to the KPI is within target.

Untitled 5.png

Alerts created by users are now listed alongside scheduled looks on the Looker Admin page together with a history of previous alert executions.

Untitled 6.png

Content Curation (Beta)

Boards are a new type of content object within Looker 6.16 and give you the ability to “pin” dashboards and looks relating to a particular topic or department together in a single place. Like conditional alerts, the new Content Creation / Boards (Beta) feature has to be enabled by an admin before it becomes available to end-users, but once you’ve done so then a new Boards (beta) menu item appears in the Browse page menu.

Untitled 8.png

Click on the Pin a Dashboard or Look button to bring up a dialog listing all of the dashboards and looks in spaces you’re able to access, like this:

Untitled 9.png

You can then add more dashboards or looks, and create sections within each board to further organize your content, as I’ve done here with our internal analytics board.

Untitled 7.png

IDE Folders

The final new feature I wanted to highlight isn’t a beta, but it does need to be enabled for each individual LookML project before use. To switch on the new folders feature for your LookML project first select Project Settings from the drop-down menu in the LookML developer IDE:

Untitled 3a.png

Then, check the Enable Folders checkbox just above the Save Project Settings button.

Untitled 3b.png

Now, press the “+” button to see a new Create Folder option.

Untitled 3d.png

Then once you’ve given your folder a name, you can move existing view, model and other LookML object files into the folder you’ve just created. Note how the various LookML files have different icons now for views, models etc.

Untitled 3e.png

Release notes for Looker 6.16 are on the Looker website, and contact us now if you’d like to find out how we can help you get moving with your Looker implementation.

News on the Second London Looker Developer Meetup, 10th July 2019 at GoCardless, London

The second London Looker Developer Meetup is less than two weeks away now, running from 6pm - 9pm on Wednesday July 10th 2019 at the GoCardless offices in Goswell Road, London. Co-hosted by Mark Rittman and Jon Palmer from GoCardless, join us for an evening of chat, networking, presentations and Q&As around the Looker platform and the following agenda:

  • 6:00 - 6:30pm: Registration, Food & Networking

  • 6:30 - 6:40pm: Welcome and Introduction - Jon Palmer, Head of BI at GoCardless and Mark Rittman, CEO of Rittman Analytics

  • 6:40 - 7:00pm: “Getting Started with Looker” - Baha Sahin, Onfido

  • 7:00 - 7:20pm: “Balancing Liquidity in an On-Demand Staffing Marketplace” - Charles Armitage, Florence

  • 7:20 - 7:35pm: Update from Looker - Dave Hughes and Zara Wells, Customer Success at Looker

  • 7:35 - 8:00pm: Data Analytics Panel

  • 8:00 - 9:00pm: Networking and close

We’re particularly looking-forward to Charles Armitage’s presentation as Florence were Rittman Analytics’ first ever UK client when we launched the company back last year, and we’ve continued to work with Charles and the team since our original engagement right through to their recent successful Series A funding round.

The First London Looker meetup ended-up oversubscribed so make sure you register now if you’re planning on attending, and I’ll be joined by the Rittman Analytics team on the night if you’d like to chat with one of us about your Looker implementation - drop me an email at mark.rittman@rittmananalytics.com if you’d like to schedule a chat in-advance.

Continuous Integration and Automated Build Testing with dbtCloud

If you read my recent blog post on how Rittman Analytics built our operational analytics platform running on Looker, Stitch, Google BigQuery and dbt, you’ll know that we built-out the underlying data infrastructure using a modular Extract, Transform and Load (ELT) design pattern, like this:

ra_analytics_architecture (1).png

We version-control our dbt development environment using git and Github.com, and do all new development beyond simple bug fixes as git feature branches, giving us a development process for dbt that looked like this:

original2.png

1. We’d start by cloning dbt git repo master branch from github.com to the developer’s workstation, which also would have dbt installed locally along with the Google Cloud SDK so that they can connect to our development BigQuery dataset.

clone.png

2. Create a new, local git branch for the new feature using the git CLI or a tool such as Github desktop

newnrahc.png

3. Develop the new feature locally using the developer’s install of dbt, committing any changes to the feature branch in that developer’s local git repo after checking that all dbt tests have run successfully.

locals-imac:ra_dw markrittman$ dbt run --models harvest_time_entries harvest_invoices --target dev
Running with dbt=0.13.1
Found 50 models, 8 tests, 0 archives, 0 analyses, 109 macros, 0 operations, 0 seed files, 35 sources

20:53:11 | Concurrency: 1 threads (target='dev')
20:53:11 | 
20:53:11 | 1 of 2 START table model ra_data_warehouse_dbt_dev.harvest_invoices.. [RUN]
20:53:13 | 1 of 2 OK created table model ra_data_warehouse_dbt_dev.harvest_invoices [OK in 1.80s]
20:53:13 | 2 of 2 START table model ra_data_warehouse_dbt_dev.harvest_time_entries [RUN]
20:53:15 | 2 of 2 OK created table model ra_data_warehouse_dbt_dev.harvest_time_entries [OK in 1.70s]
20:53:15 | 
20:53:15 | Finished running 2 table models in 5.26s.

Completed successfully

Done. PASS=2 ERROR=0 SKIP=0 TOTAL=2
locals-imac:ra_dw markrittman$ dbt test --models harvest_time_entries harvest_invoices --target dev
Running with dbt=0.13.1
Found 50 models, 8 tests, 0 archives, 0 analyses, 109 macros, 0 operations, 0 seed files, 35 sources

20:53:37 | Concurrency: 1 threads (target='dev')
20:53:37 | 
20:53:37 | 1 of 4 START test not_null_harvest_invoices_id....................... [RUN]
20:53:39 | 1 of 4 PASS not_null_harvest_invoices_id............................. [PASS in 1.50s]
20:53:39 | 2 of 4 START test not_null_harvest_time_entries_id................... [RUN]
20:53:40 | 2 of 4 PASS not_null_harvest_time_entries_id......................... [PASS in 0.89s]
20:53:40 | 3 of 4 START test unique_harvest_invoices_id......................... [RUN]
20:53:41 | 3 of 4 PASS unique_harvest_invoices_id............................... [PASS in 1.08s]
20:53:41 | 4 of 4 START test unique_harvest_time_entries_id..................... [RUN]
20:53:42 | 4 of 4 PASS unique_harvest_time_entries_id........................... [PASS in 0.83s]
20:53:42 | 
20:53:42 | Finished running 4 tests in 5.27s.

Completed successfully

4. All dbt transformations at this stage are being deployed to our development BigQuery dataset.

5. When the feature was then ready for deployment to our production BigQuery dataset, the developer would then push the changes in their local branch to the remote git repo, creating that branch if it didn’t already exist.

commit.png

6. Then they’d create a pull request using Github Desktop and the Github web UI summarising the changes and new features added by the development branch.

create_pr.png

7. I’d then review the PR, try and work out if the changes were safe to merge into the master branch and then accept the pull reques, and then overnight our dbtCloud service would clone that development git repo master branch and attempt to deploy the new set of transformations to our production BigQuery dataset.

And sometimes, if for whatever reason that feature branch hadn’t been properly build-tested, that scheduled overnight deployment would then fail.

fail.png

So having noticed the new Build on Pull Request feature that comes with paid versions of dbtCloud, we’ve upgraded from the free version to the $100/month basic paid version and added an automated “continuous integration” build test to our feature branch development process so that it now looks like this:

DIAG2.png

To set this automated build test feature up, we first linked our Github account to dbtCloud and then created a new dbtCloud job that triggers when a user submits a pull request.

buildtest.png

8. Now, when I come to review the pull request for this feature branch, there’s an automated test added by dbtCloud that checks to see whether this new version of my dbt package deploys without errors.

buildstep.png

9. Looking at the details behind this automated build test, I can see that dbtCloud has created a dataset and runs through a full package deployment and test cycle, avoiding any risk of breaking our production dbt environment should that test deployment fail.

pass.png

10. Checking back at the status of the test build in Github, I can see that this test completed with no issues, and I’m therefore safe to merge the changes in this feature branch into our development master branch, ready for dbtCloud to pick-up these changes overnight and deploy them as scheduled into production.

testpass.png

Docs on this recent dbtCloud feature are online here, and you’ll need at least the basic $100/month paid version of dbtCloud for this feature to become available in the web UI. It’s also dependent on Github as the git hosting service, so not available as an option if you’re using Gitlab, BitBucket or CodeCommit.

Customer Journey and Lifetime Value Analytics using Looker, Google BigQuery, Stitch and dbt

One of the most common analytics use-cases I come across on client projects is to understand the “customer journey”; the series of stages prospects and customers typically go through from becoming aware of a company’s products and services through to consideration, conversion, up-sell and repurchase and eventually, to churn.

Customers typically interact with multiple digital and offline channels over their overall lifetime, and if we take the model of a consulting services business such as Rittman Analytics, a typical customer journey and use of those touch-points might look like the diagram below.

1_hMO8fEmwA6A3McyUfhWMZA.png

Each customer and prospect interaction with those touch-points sends signals about their buying intentions, product and service preferences and readiness to move through to the next stage in the purchase funnel.

1_V36-h1V-t7MDWZBG-Eqoaw.png

Creating visualization such as these in Looker typically involves two steps of data transformation either through a set of Looker (persistent) derived tables, or as we’ve done for our own operational analytics platform, through two dbt (Data Build Tool) models sequenced together as a transformation graph.

To start, we select a common set of columns from each digital touchpoint channel as brought in by our project data pipeline tool, Stitch; Hubspot for inbound and outbound marketing communctions together with sales activity and closes/loses; Harvest for billable and non-billable client days; Segment for website visits and so on. For each source we turn each type of customer activity they record into a type of event, details of the event and a monetary value (revenue, cost etc) that we can then use to measure the overall lifetime value of each client later on. 

1_e_DGbbE9s3R9i7GCfyY3wg.png

Note the expressions at the start of the model definition that define a number of recency measures (days since last billable day, last contact day and so on) along with another set that are used for defining customer cohorts and bucketing activity into months and weeks since those first engagement date.

{{
    config(
        materialized='table',
        partition_by='DATE(event_ts)'
    )
}}
SELECT
    *,
    {{ dbt_utils.datediff('last_billable_day_ts', current_timestamp(), 'day')}} AS days_since_last_billable_day,
    {{ dbt_utils.datediff('last_incoming_email_ts', current_timestamp(), 'day')}} AS days_since_last_incoming_email,
    {{ dbt_utils.datediff('last_outgoing_email_ts', current_timestamp(), 'day')}} AS days_since_last_outgoing_email,
    {{ dbt_utils.datediff('last_site_visit_day_ts', current_timestamp(), 'month')}} AS months_since_last_site_visit_day,
    {{ dbt_utils.datediff('last_site_visit_day_ts', current_timestamp(), 'week')}} AS weeks_since_last_site_visit_day
FROM
  (SELECT
      *,
      ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY event_ts) AS event_seq,
      MIN(CASE WHEN event_type = 'Billable Day' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_billable_day_ts,
      MAX(CASE WHEN event_type = 'Billable Day' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_billable_day_ts,
      MIN(CASE WHEN event_type = 'Client Invoiced' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_invoice_day_ts,
      MAX(CASE WHEN event_type = 'Client Invoiced' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_invoice_day_ts,
      MAX(CASE WHEN event_type = 'Site Visited' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_site_visit_day_ts,
      MAX(CASE WHEN event_type = 'Incoming Email' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_incoming_email_ts,
      MAX(CASE WHEN event_type = 'Outgoing Email' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_outgoing_email_ts,
      MIN(event_ts)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_contact_ts,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type = 'Billable Day' THEN date(event_ts) END)          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},MONTH) AS months_since_first_billable_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type = 'Billable Day' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},WEEK) AS weeks_since_first_billable_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type like '%Email%' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},MONTH) AS months_since_first_contact_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type like '%Email%' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},WEEK) AS weeks_since_first_contact_day,
      MAX(CASE WHEN event_type = 'Billable Day' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS billable_client,
      MAX(CASE WHEN event_type LIKE '%Sales%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS sales_prospect,
      MAX(CASE WHEN event_type LIKE '%Site Visited%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS site_visitor,
      MAX(CASE WHEN event_details LIKE '%Blog%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS blog_reader,
      MAX(CASE WHEN event_type LIKE '%Podcast%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS podcast_reader
  FROM
  -- sales opportunity stages
      (SELECT
            deals.lastmodifieddate AS event_ts,
          customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          deals.dealname AS event_details,
          deals.dealstage AS event_type,
          AVG(deals.amount) AS event_value
      FROM
          {  } AS customer_master
      LEFT JOIN
          {  } AS deals
          ON customer_master.hubspot_company_id = deals.associatedcompanyids
      LEFT JOIN
          {  } AS owners
          ON deals.hubspot_owner_id = CAST(owners.ownerid AS STRING)
      WHERE
          deals.lastmodifieddate IS NOT null
      {  }
      UNION ALL
  -- consulting days
      SELECT
            time_entries.spent_date AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          projects.name AS event_details,
          CASE WHEN time_entries.billable THEN 'Billable Day' ELSE 'Non-Billable Day' END AS event_type,
          time_entries.hours * time_entries.billable_rate AS event_value
      FROM
          {  } AS customer_master
      LEFT JOIN
          {  } AS projects
          ON customer_master.harvest_customer_id = projects.client_id
      LEFT JOIN
          {  } AS time_entries
          ON time_entries.project_id = projects.id
      WHERE
          time_entries.spent_date IS NOT null
      {  }
  UNION ALL
  -- incoming and outgoing emails
      SELECT
            communications.communication_timestamp AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          communications.communications_subject AS event_details,
          CASE WHEN communications.communication_type = 'INCOMING_EMAIL' THEN 'Incoming Email'
               WHEN communications.communication_type = 'EMAIL' THEN 'Outgoing Email'
               ELSE communications.communications_subject
          END AS event_type,
          1 AS event_value
      FROM
          {  } AS customer_master
      LEFT JOIN
          {  } AS communications
          ON customer_master.hubspot_company_id = communications.hubspot_company_id
      WHERE
          communications.communication_timestamp IS NOT null
      {  }
  UNION ALL
  -- sales opportunity stages
      SELECT
            invoices.issue_date AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          invoices.subject AS event_details,
          'Client Invoiced' AS event_type,
          SUM(invoices.amount) AS event_value
      FROM
          {  } AS customer_master
      LEFT JOIN
          {  } AS invoices
          ON customer_master.harvest_customer_id = invoices.client_id
      WHERE
          invoices.issue_date IS NOT null
      {  }
  UNION ALL
      SELECT
         invoices.issue_date AS event_ts,
         customer_master.customer_id AS customer_id,
         customer_master.customer_name AS customer_name,
         invoice_line_items.description AS event_details,
         'Client Credited' AS event_type,
          COALESCE(SUM(invoice_line_items.amount ), 0) AS event_value
      FROM
          {  } AS customer_master
      LEFT JOIN
          {  } AS invoices
          ON customer_master.harvest_customer_id = invoices.client_id
      LEFT JOIN
          {  } AS invoice_line_items
          ON invoices.id = invoice_line_items.invoice_id
      {  }
      HAVING
         (COALESCE(SUM(invoice_line_items.amount ), 0) < 0)
  UNION ALL
      SELECT 
              pageviews.timestamp AS event_ts,
              customer_master.customer_id  AS customer_id,
              customer_master.customer_name AS customer_name,
              pageviews.page_subcategory AS event_details,
              'Site Visited' AS event_type,
              sum(1) as event_value
      FROM 
          {{ ref('customer_master') }}  AS customer_master
     LEFT JOIN 
          {{ ref('pageviews') }} AS pageviews 
          ON customer_master.customer_name = pageviews.network
      {{ dbt_utils.group_by(n=5) }}
  UNION ALL
      SELECT
          *
      FROM
          (SELECT
              invoices.paid_at AS event_ts,
          customer_master.customer_id AS customer_id,
              customer_master.customer_name AS customer_name,
              invoices.subject AS event_details,
              CASE WHEN invoices.paid_at <= invoices.due_date THEN 'Client Paid' ELSE 'Client Paid Late' END AS event_type,
              SUM(invoices.amount) AS event_value
          FROM
              {  } AS customer_master
          LEFT JOIN {  }  AS invoices
              ON customer_master.harvest_customer_id = invoices.client_id
          WHERE
            invoices.paid_at IS NOT null
          {  }
          )
      )
  WHERE
      customer_name NOT IN ('Rittman Analytics', 'MJR Analytics')
  )

Deploying this model and then querying the resulting table, filtering on just a single customer and ordering rows by timestamp, you can see the history of all interactions by that customer with our sales and delivery channels.

1_KLvPEmYDL8kfH-u4brfj-g.png

You can also plot all of these activity types into a combined area and bar chart, for example, to show billable days initially growing from zero to a level that we maintain over several months, and with events such as client being invoiced, incoming and outgoing communications and the client actually paying in-full at the end.

1_CpYJkIbFnWL2PO7Rc5RtNQ.png

We can also do things such as restrict the event type reported on to billable days, and then use Looker’s dimension group and timespans feature to split the last set of weeks into days and week numbers to then show client utilisation over that period.

1_oCOLcQ-CJlq6ogPxO9jQRg.png

Of course we can also take those monetary values assigned to events and use them to calculate the overall revenue contribution for this customer, factoring in costs incurred around preparing sales proposals, answering support questions and investing non-billable days in getting a project over the line.

1_mHGgPI5mHPs_repqB3Nj1Q.png

Our second dbt model takes this event data and pivots each customer’s events into a series of separate columns, one for the first event in sequence, another for the second, another for the third and so on.

1_2ASOQfs1GYgHM3F36ozjRw.png

Again, code for this dbt model is available in our project git repo.

{{
    config(
        materialized='table'
    )
}}
WITH event_type_seq_final AS (
    SELECT
        customer_id,
        user_session_id AS event_type_seq,
        event_type,
        event_ts,
        is_new_session
    FROM
        (SELECT
            customer_id,
            event_type,
            last_event,
            event_ts,
            is_new_session,
            SUM(is_new_session) OVER (ORDER BY customer_id ASC, event_ts ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id,
            SUM(is_new_session) {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS user_session_id
        FROM
            (SELECT
                *,
                CASE WHEN event_type != LAG(event_type, 1) OVER (PARTITION BY customer_id ORDER BY event_ts ASC) OR last_event IS NULL THEN 1
                ELSE 0
                END AS is_new_session
            FROM
                (SELECT
                    customer_id,
                    customer_name,
                    CASE WHEN event_type LIKE '%Email%' THEN 'Presales'
                    WHEN event_type LIKE '%Bill%' THEN 'Delivery' ELSE event_type END AS event_type,
                    event_ts,
                    LAG(CASE WHEN event_type LIKE '%Email%' THEN 'Presales' WHEN event_type LIKE '%Bill%' THEN 'Delivery' ELSE event_type END, 1) OVER (PARTITION BY customer_id ORDER BY event_ts ASC) AS last_event
                FROM
                    {  }
                ORDER BY
                    customer_id ASC,
                    event_ts ASC,
                    event_type ASC
                ) last
             ORDER BY
                customer_id ASC,
                event_ts ASC,
                event_type ASC
           ) final
        ORDER BY
            customer_id ASC,
            is_new_session DESC,
            event_ts ASC
        )
    WHERE
        is_new_session = 1
    {  }
    ORDER BY
        customer_id,
        user_session_id,
        event_ts
)
SELECT
    customer_id,
    MAX(event_type_1) AS event_type_1,
    MAX(event_type_2) AS event_type_2,
    MAX(event_type_3) AS event_type_3,
    MAX(event_type_4) AS event_type_4,
    MAX(event_type_5) AS event_type_5,
    MAX(event_type_6) AS event_type_6,
    MAX(event_type_7) AS event_type_7,
    MAX(event_type_8) AS event_type_8,
    MAX(event_type_9) AS event_type_9,
    MAX(event_type_10) AS event_type_10
FROM
    (SELECT
        customer_id,
        CASE WHEN event_type_seq = 1 THEN event_type END AS event_type_1,
        CASE WHEN event_type_seq = 2 THEN event_type END AS event_type_2,
        CASE WHEN event_type_seq = 3 THEN event_type END AS event_type_3,
        CASE WHEN event_type_seq = 4 THEN event_type END AS event_type_4,
        CASE WHEN event_type_seq = 5 THEN event_type END AS event_type_5,
        CASE WHEN event_type_seq = 6 THEN event_type END AS event_type_6,
        CASE WHEN event_type_seq = 7 THEN event_type END AS event_type_7,
        CASE WHEN event_type_seq = 8 THEN event_type END AS event_type_8,
        CASE WHEN event_type_seq = 9 THEN event_type END AS event_type_9,
        CASE WHEN event_type_seq = 10 THEN event_type END AS event_type_10
    FROM
        event_type_seq_final
    )
{  }

Pivoting each customer’s events in this way makes it possible then to visualize the first n events for a given set of customers using Looker’s Sankey custom visualization, as shown in the final screenshot below.

Untitled 6.png

More details on our own internal use of Looker, Stitch, dbt and Google BigQuery to create a modern operational analytics platform can be found in this earlier blog post, and if you’re interested in how we might help you get started with Looker, check out our Services page.

How Rittman Analytics does Analytics: Modern BI Stack Operational Analytics using Looker, Stitch, dbt and Google BigQuery

As well as creating modern BI stack analytics solutions for clients such as Florence, Let’s Do This and Colourpop using technologies from our partners at Looker, Stitch, Segment, Fivetran and Snowflake, it’s probably no surprise that we’ve used those same technologies and project experience to build-out our own internal analytics platform, covering data and operational analytics use-cases such as

  • Showing high-level KPIs that show progress towards our four main business objectives

  • Providing actionable context for these high-level KPIs and the ability to drill-into the detail behind them

  • Enabling ad-hoc querying and analysis of our complete dataset by technical and non-technical users

  • Providing a trusted, integrated analysis-ready data platform for other internal projects

overview.png

I thought it might be of interest to go through how we’ve used the tools and techniques we use for client projects but in this case to deliver an analytics solution for our own internal use, sharing some of our thinking around how we defined our underlying KPI framework, the tools and design patterns we used when building out the data management part of the platform, and some of the front-end components and data visualizations we’ve put together using the Looker BI tool.

Defining your KPI Framework

When creating any sort of analytics solution the first thing you need to understand is - what are your business objectives; what do you need to measure to understand your progress towards those objectives; and how do you then define success?

Key Performance Indicators (KPIs) and measures within our analytics platform are based around what's needed to measure our progress towards four business objectives we set ourselves for FY2019:

  1. Increase Billing Revenue

  2. Increase Profitability

  3. Increase customer retention, and

  4. Increase our operational and delivery efficiency

In-practice, we “certify” data around these four primary KPIs and also a number of secondary measures, with data in the platform normally up-to-date as of the previous business day and with historical data going back to when we started the business.

kpi.png

Other metrics, measures and dimension attributes are provided alongside these certified KPIs and measures and are considered “best endeavours” in terms of our certifying their accuracy, and users then of course able to create their own calculations for use in reports and analyses outside of this formal curation.

Platform Architecture

At a high-level our internal analytics platform extracts data on a regular basis from the SaaS-based services we use to run our business operations, joins that data together so that each dataset uses the same definition of clients, products and other reference data, then makes that combined dataset available for analysis as a set of business KPIs and an ad-hoc query environment.

ra_analytics_architecture.png

The design of our internal analytics platform was based around a modular BI technology stack, ELT ("Extract, Land and Transform") data warehouse loading pattern and modern software development approach that we use on all of our client analytics development engagements.

Data Pipeline and Source Connectivity

We’re digital-first, paperless and run our business operations using cloud-hosted software services such as:

  • Harvest for timesheets, project invoicing and client-rechargeable expenses

  • Harvest Forecast for project resource planning and revenue forecasts

  • Xero for accounting and payroll

  • Hubspot CRM for sales and contact management

  • Docusign for contract management

  • Jira for project management and issue tracking

  • BambooHR for human resources and recruitment

On the delivery side, Rittman Analytics partners with four different data pipeline-as-a-Service software vendors, each of which targets a different segment of the market and that segment’s particular use-cases:

  • Stitch, who focus on small to mid-market startup customers with data engineers who value service extensibility and their land-and-expand pricing model ($100/month for up to five sources, including Hubspot and Xero certified and Harvest community-supported connectors)

  • Fivetran, aimed more at mid-market to enterprise customers and the data analyst personas, with a more turnkey solution and increased levels of service and coverage of enterprise data sources

  • Segment, more of an "enterprise service bus" for event-level digital data streams, connecting multiple event producers with downstream event consumers and offering additional features around customer journey mapping, personas and schema validation

  • Supermetrics, aimed at growth hackers and digital marketers with a focus on simple self-service setup, comprehensive data dictionaries and a comprehensive catalog of advertising, marketing and social data sources delivered through a Google Sheets add-in or more recently, a direct connector into Google BigQuery through the GCP Marketplace.

We chose to use Stitch's service for our data pipeline for this internal project due to their out-of-the-box ability to source data from Harvest and Harvest Forecast; functionally both Fivetran and Stitch would meet our needs other than support for these two particular sources, and we've used Fivetran extensively on other internal and client projects in the past.

stitch3.png

We also use Segment's event tracking service to track visitor activity on our website and land those raw tracking events also in our BigQuery project, and intend to make use of Segment's Personas product in due course to enable us to build out 360-degree views of our clients, site visitors and prospects based on their interactions across all of our various digital touchpoints (web, social media, inbound and outbound lead generation, chat etc)

Data Transformation, Data Integration and Orchestration

We transform and integrate raw data that Stitch syncs from each of our SaaS application sources into an integrated, query-orientated data warehouse dataset using the open-source dbt toolkit.

Once we'd decided to work with a data pipeline-as-a-service such as Stitch together with a SQL-based data management platform like Google BigQuery, the decision to transform and integrate our data via a series of SQL SELECT was the obvious next design choice; by using dbt and version-controlling our scripts in a Github repository we increased our productivity when developing these transformation, adhered to modern software design principles and avoided cut-and-paste scripting in-favour of templated, maintainable code.

Screenshot_2019-05-07_at_19.44.16.png

Transformation steps in dbt are basically SQL SELECTs with references to metadata models replacing hard-coded table names and use of macros as reusable, often cross-database transformation components giving three main benefits over hand-written, author-specific code:

  • Increased analyst productivity through leveraging libraries of best-practice transformation components

  • More maintainable loading processes with concise logic that “doesn’t repeat yourself” (DRY)

  • Portable code that can be deployed on BigQuery today or Snowflake tomorrow if required

For example, as shown in the example code snippet below.

SELECT
        *
    FROM (
        SELECT
            harvest_user_id,
            a.id,
            allocation,
            project_id,
            start_date,
            end_date,
            {{ dbt_utils.datediff(start_date, end_date, 'day')}} +1 AS forecast_days,
            a._sdc_sequence ,
            MAX(a._sdc_sequence) OVER (PARTITION BY a.id ORDER BY a._sdc_sequence 
                RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS latest_sdc_sequence
        FROM
            {{ ref('harvest_forecast_assignments') }} a
        INNER JOIN
            {{ ref('harvest_forecast_people') }} p
            ON person_id = p.id
        )
    WHERE
        _sdc_sequence = latest_sdc_sequence
      )

In addition to the open-source dbt toolkit we also use the commercial dbtCloud service from our friends at Fishtown Analytics, mainly in order to then be able to execute our transformation graph in the cloud, like this:

dbt_cloud.png

dbtCloud also hosts the data dictionary automatically generated from the metadata in our dbt package, and diagrams the transformation graph and data lineage we’ve defined through the dependencies we created between each of the models in the package.

Screenshot_2019-05-07_at_13.00.00.png

As well as standardising each data source and mapping each source's set of customer references into a single, master customer lookup table that allows us to analyze operations from sales prospect through project delivery to invoicing and payment, we also turn every customer touchpoint into events recorded against a single view of each customer, like this:

Screenshot_2019-05-07_at_20.12.37.png

We then assign a financial revenue or cost value to each event in the customer journey, calculate metrics such as time between interactions (to measure engagement), value delivered to the client and use that information to better understand and segment our clients so as to provide the advice and service they're looking for at this particular stage in their lifecycle and relationship with our team.

event_model.png

Analytics and Data Visualization

We then use Looker as our primary analytics tool over the data prepared by dbt and stored in Google BigQuery. Everyone's homepage in Looker is set by default to show those four main operational KPIs with comparison to target and with the last six months’ trend history, providing the context when combined with Looker’s ability to go deep into the detail behind those headline numbers so that users can decide and action on what they’re seeing (note that I’ve altered and obfuscated our actual numbers in the next few screenshots)

dashboard2.png

We've also created a single business operations-wide Looker explore that enables users to analyze and explore all aspects of our sales, delivery and financial relationship with clients, suppliers, partners and prospects.

In the example below we've again using the event model and applying sequencing to those events bucketed by months since first billable day, to help us understand how the the demand decay curve looks for each cohort of clients we've taken on over the first year we've been operating.

decay.png

Embedded Analytics using Powered by Looker

We also link fields in our Looker explore back to the online services that can provide more background detail on a metric displayed on the screen, in the example below linking back to Harvest to see details of the project displayed in the explore data visualisation.

drilltoapp.png

Finally, for team members who are more likely to be working within our knowledge-sharing and internal company portal site, we use Powered by Looker embedded analyics within Notion, connecting our internal analytics metrics and insights directly into the productivity and workflow tools used by the team, bringing data and analytics to the problems they’re looking to solve rather than the other way around.

notion.png

So, hopefully this look behind the scenes at how we use these modern analytics stack technologies for our own analytics needs has been useful, and feel free to leave comments or contact me at mark.rittman@rittmananalytics.com if any of this would be applicable to your business. In the meantime, the dbt and Looker git repos used to build-out our platform are available as public git repos on Github:

Feel free to fork, improve on what we’ve done, submit PRs or just see a bit more of the detail behind the examples in this post.

Looker London Meetup on Thursday, February 7th 2019 - Registration Now Open!
meetup2.png

If you’re in London on Thursday 7th February 2019 and a developer or end-user working with the Looker Data Platform, we’re very pleased to be launching together with Jon Palmer from Go Cardless the first ever Looker London Dev Meetup, sponsored by the good folks at Looker and hosted at Go Cardless’ HQ at 65 Goswell Road, London EC1V 7EN.

The aim is to create a space for showcasing and discussing what data people in London are building in and around the Looker ecosystem, and for this first meetup we have three great customer and partner speakers sharing their insights on Looker development and project implementations:

  • "Scale securely with the Looker API" Dylan Baker, Consultant

  • "Internal and External Multi-tenancy with Looker" - Ian Thompson, King

  • "Increase End-User Engagement with Looker 6 System Activity Model" - Mark Rittman, MJR Analytics

followed by a panel session where the speakers are joined by one of our guests from Looker:

  • “Data Analytics Panel - Hear from Industry experts and Looker users from GoCardless, King, MJR analytics and Looker.”

Registration and attendance is free but spaces are limited, so make sure you register soon if you’d like to come along - the meetup starts at at 5.30pm with drinks, food and introductions from Jon, full agenda is on the meetup page and we’re aiming to wrap-up around 9pm with closing comments from me and a short survey on future meetup content.

Event-Level Digital Analytics using Google Analytics, Fivetran, BigQuery and Looker

A few weeks ago I posted a blog on using Mixpanel, Looker and Google BigQuery to analyze listener data from the Drill to Detail website, with Mixpanel tracking individual episode play events by visitors to the site and Fivetran replicating that event data over to BigQuery for analysis using Looker.

Mixpanel is great but like most digital businesses using Google Analytics (GA) to analyze and measure activity on their website, we actually use Google Tag Manager and GA events to track visitor activity on the mjr-analytics.com website. Whilst a full site license for Google Analytics 360 is way beyond our budget right now, the free version of GA together with GTM provides us with an unsampled, event-level stream of visitor events and together with goals and simple A/B testing tools a way to start optimizing the content on our site. 

Couple all of this with Fivetran and Looker and we can start to do some interesting analytics on all this data to not only count sessions, visitors and pages viewed but also start looking at visitor retention and the routes those visitors take through the various features on our website.

Events in Google Analytics are a way of recording interactions visitors to your site make with items on a page such as selecting an item on a navigation menu, clicking on a download link or seeing a special offer or product recommendation. On the mjr-analytics.com website we’ve setup events, for example, to record how far down the page a visitor scrolls so we can see how many people actually scroll past the hero image on our front page and look at the content on Looker services, Oracle services and so on.

Untitled 2.001.png

Another event records clicks on the specific hero image a visitor clicks on with another recording what navigation menu item was clicked on, in each case storing the image name or menu item clicked on as the event label.

Untitled 2.002.png

Whilst you can record events being triggered by adding custom Javascript to your website’s HTML page definitions, we use GTM’s graphical point-and-click interface to set-up our events and the visitor actions that trigger them.

Untitled 2.005.png

Enabling GTM’s debug mode for our site shows us which events then trigger when visiting our site, with the screenshot below showing a typical visitor journey through our homepage and the events that are triggered as they scroll-through and interact with various items on the page.

Untitled 2.003.png

Data on events being triggered along with event label and event action values are stored in a Javascript data layer that can also receive values sent over, for example, by an eCommerce application so that clicks on a product catalog item can record the price on offer as well as what product was clicked on. 

Looking at the Data Layer view in Google Tag Manager you can see the event entries recorded for this visitor to the site page, but note also how nothing personally identifiable is recorded except for a unique client ID set at the device/browser level that can then be used to analyze new vs. returning visitors and track visitor retention and engagement over time.

Untitled 2.004.png

As we then did with the Mixpanel example and the Drill to Detail website, we then use Fivetran to replicate all of the data layer events over to Google BigQuery, storing this granular visitor behavioural data in a BigQuery table and dataset. As you can see from the screenshot below we use Fivetran extensively within MJR Analytics; replicating invoices and all our other accounting data from Xero into BigQuery along with timesheet data from Harvest, CRM data from Hubspot, commits and PRs from Github, tasks outstanding and completed from Asana and search rankings and keywords from Google Search Console all into a Google BigQuery project whose datasets we then combine and analyze together using Looker.

Untitled 2.006.png

For the event data coming across from GA we use Fivetran to replicate the relevant event tracking dimensions and metrics into BigQuery every five minutes, giving us near real-time numbers on what’s popular and what’s engaging on our company website.

Untitled 2.007.png

As the event data from GA arrives as table rows of individual events being triggered, I then sessionize those individual events into rows of visitor sessions, using BigQuery SQL to pivot the incoming data and create individual columns for the first event, second event, first page view and first product item clicked on, for example, within an individual visitor session. The actual SQL used for the view is too long really to include here but to get the idea, here’s a condensed version with repeating parts removed for clarity.

SELECT
  global_session_id,
  visitor_session_id,
  visitor_id,
  visitor_session_start_date_time,
  MAX(visitor_session_length_mins) AS visitor_session_length_mins,
  visitor_first_event_week_cohort,
  city,
  session_events_count,
  ...
  event_1_category,
  event_1_label,
  ...
  page_view_1,
  page_view_2,
  ...
FROM (
  SELECT
    e.global_session_id,
    e.visitor_session_id,
    e.visitor_id,
    e.visitor_session_start_date_time,
    e.visitor_session_length_mins,
    e.visitor_first_event_week_cohort,
    e.city,
    MAX(e.visitor_session_event_seq_num) OVER (PARTITION BY e.global_session_id) AS session_events_count,
    (
    SELECT
      COUNT(*)
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      global_session_id = e.global_session_id
      AND event_category = 'Page View') AS page_view_count,
    SELECT
      COUNT(*)
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      global_session_id = e.global_session_id
      AND event_category = 'Nav Bar Clicks') AS nav_bar_click_count,
    ...,
    e1.event_label AS event_1_label,
    e2.event_category AS event_2_category,
    e2.event_label AS event_2_label,
    ...
    pv.event_label AS page_view_1,
    pv2.event_label AS page_view_2,
    ...,
    igc.event_label AS index_gallery_first_click
  FROM
    `aerial-vehicle-148023.dw_staging.ga_events` e
  LEFT OUTER JOIN (
    SELECT
      global_session_id,
      event_label,
      ROW_NUMBER() OVER (PARTITION BY global_session_id ORDER BY visitor_session_event_seq_num) page_view_seq_num
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      event_category = 'Page View'
    ORDER BY
      3) pv
  ON
    e.global_session_id = pv.global_session_id
    AND pv.page_view_seq_num = 1
  LEFT OUTER JOIN (
    SELECT
      global_session_id,
      event_label,
      ROW_NUMBER() OVER (PARTITION BY global_session_id ORDER BY visitor_session_event_seq_num) page_view_seq_num
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      event_category = 'Page View'
    ORDER BY
      3) pv2
  ON
    e.global_session_id = pv2.global_session_id
    AND pv2.page_view_seq_num = 2
  ...
  JOIN
    `aerial-vehicle-148023.dw_staging.ga_events` e1
  ON
    e.global_session_id = e1.global_session_id
    AND e1.visitor_event_seq_num = 1
  LEFT OUTER JOIN
    `aerial-vehicle-148023.dw_staging.ga_events` e2
  ON
    e.global_session_id = e2.global_session_id
    AND e2.visitor_event_seq_num = 2
  ...
GROUP BY
  1,
  2,
  ...,
  46
ORDER BY
  1

Creating a view or materialized table from this BigQuery SQL then gives me one record per session with event totals, page view and event paths along with session and client IDs that are perfect then for bringing into Looker for further analysis.

Untitled 2.008.png

Then finally it’s just a case of bringing that table of data into Looker as a LookML view and using one of Looker’s custom visualization blocks to display visitor event paths as a Sankey diagram, for example.

Untitled 2.009.png

Or we can aggregate and analyze counts of events from sessions to see in graphical form which offers, download items and navigation menu items are most popular over a given period.

Untitled 2.010.png

If you’d like to understand more about how MJR Analytics can help you better understand what visitors are looking at and interacting with on your website then drop me an email at mark.rittman@mjr-analytics.com; or even better, if you’re in Copenhagen on November 22nd and have the evening free, come along to the free meetup we’re hosting together with our partners CIMA and Looker and you’ll see me demo this all live, in person and show you how it works.

Media Analytics for the Drill to Detail Podcast using Looker, Fivetran and Mixpanel

Most readers of the MJR Analytics blog will no doubt be aware of the Drill to Detail Podcast on the big data analytics industry that we present and host, with the most recent episode covering the Looker JOIN 2018 event in San Francisco with Tristan Handy and Stewart Bryson as special guests. Things have been a bit busy over the past month with the launch of MJR Analytics and then the Looker and Oracle conferences but we try and upload a new episode every few weeks, with the next episode — fingers crossed — having a rather special guest if we can get our schedules aligned.

Listeners to the show typically either use services such as Apple’s iTunes Podcast Directory to download episodes and play them typically on their iOS mobile devices or they can visit the show’s website at https://www.drilltodetail.com where they can download or play episodes through an embedded audio player and click through the links in the show notes that accompany each episode.

Apple provides statistics at episode level to podcast owners via their new Podcast Analytics service but you’re limited to just the charts and fixed ways that the app allows you to analyze and explore your dataset. If you can find a way to scrape or download this data and load it into a database service such as Google BigQuery though, then you could instead use a tool such as Looker to analyze and visualize your listener data in more interesting ways that are also very applicable to customer and product analytics for any digital or eCommerce site, as I’ll explain as we go through the examples.

To start, the Looker visualization below shows for each episode of the show how many in-total of that episode were listened to by users of Apple’s Podcasts Directory service on their iPhones, iPads and Mac OS devices, with the single most listened-to episode of all time within a given month being our 50th Episode Special that featured Stewart Bryson and Alex Gorbachev as our two special guests.

1_ilhCgPXXaqpil4tKSkJd2g.png

Apple also provide data on the how much of each episode was completed by the listener, so you can get a sense of how long episodes should last and how engaged your audience was with individual episodes. In the Looker visualization below you can see that typically our listeners get through around half to two-thirds of each episode but were particularly engaged with Wes McKinney episode on Apache Arrow, Python Pandas and his blog post comparing Arrow with ORC and Parquet that was generating a lot of discussion in the open-source and big data community at the time.

1_mzVle2MGHq9KlDgpwB7aQQ.png

So far the way I’ve analysed downloads and listener numbers for the podcast is not too dissimilar to how businesses used to only measures sales per month but didn’t really think about how many of their customers and visitors coming onto their site each month were new or in-fact returning customers looking to buy more. 

These days knowing how many of your customers are new or returning is the first thing you’d want to understand as an eCommerce manager because it’s cheaper to sell to an existing customer than to go out and recruit a new one, and one of your main priorities is to try and turn those first-time buyers into loyal, repeat purchasers who then become your most valuable customers.

My focus on episodes rather than listeners was in-reality more due to Apple not providing episode download and listening data at the granularity of an individual user or device ID, but some time ago I also set up Mixpanel, a Javascript event-tracking service similar to Heap and Google Analytics’s custom events, to track when visitors played episodes of the show on the site through their browser as shown in the screenshots below.

1_9-dnYMTQqFJ1E-pfBnmNDg.png

Data is still only logged at unique device level as there’s no registration facility on the site or other means to tie downloads to particular users, but now we have access to additional details of each download such as where the user was located and what type of browser and OS they were using.

Mixpanel stores the events it tracks in a database service it hosts, but you can use a managed data replication service such as the one from our partners at Fivetran, shown in the screenshots below, to copy that data into a warehousing platform such as BigQuery as I’ve done and shown below.

1_NmcCG2ET75sX-M4htrOO7Q.png

Going back to Looker I now create two new LookML views against the single table of Mixpanel tracked events that Fivetran replicated into BigQuery for me, the first of which is keyed on the unique distinct_id columns and contains dimensions and measures for the listener’s device used for accessing the site and playing-back episodes of the show.

view: unique_devices {
  derived_table: {
    sql: SELECT
  distinct_id,
  device,
  mp_country_code,
  os, region, screen_height, screen_width
FROM
  `aerial-vehicle-148023.mixpanel.event`
WHERE
  name IN ('Clicked Nav Link',
    'Podcast Episode Played')
group by 1,2,3,4,5,6,7 ;;
  }
  dimension: distinct_id {
    primary_key: yes
    label: "Device ID"
  }
  dimension: device {
    sql: case when ${TABLE}.os in ('Windows','Mac OS X','Linux') then 'Desktop'
              when ${TABLE}.os = 'Chrome OS' then 'Chromebook'
              else device end;;
  }
  dimension: os {}
  dimension: region {}
  dimension: screen_height {}
  dimension: screen_width {}
  dimension: mp_country_code {
    label: "Country"
  }
  measure: count_unique {
    type: count_distinct
    sql: ${TABLE}.distinct_id ;;
  }
measure: count {
    type: count
  }
}

The second view is keyed on event_id and contains the dimensions and measures used for analyzing and counting playback events by those listeners’ devices.

-- page events view definition
view: page_events {
  derived_table: {
    sql: SELECT
  REPLACE(JSON_EXTRACT(e.properties,
      '$.Episode'),'"','') AS episode,
      REPLACE(JSON_EXTRACT(e.properties,
      '$.url'),'"','') AS nav_url,
      episode_number, 
      distinct_id,
      current_url, 
      event_id, 
      initial_referrer, 
      initial_referring_domain, 
      name as event_type, 
      referrer, 
      referring_domain, 
      search_engine, 
      time
FROM
  `aerial-vehicle-148023.mixpanel.event` e
left outer join `aerial-vehicle-148023.dw_staging.episode_list` l
on REPLACE(JSON_EXTRACT(e.properties,
      '$.Episode'),'"','') = l.episode
WHERE
  e.name IN ('Clicked Nav Link',
    'Podcast Episode Played') ;;
  }
dimension_group: event {
    type: time
    timeframes: [
      date,
      week,
      month,
      hour_of_day,
      day_of_week,
      day_of_month
    ]
    sql: ${TABLE}.time ;;
  }
  dimension: event_type {}
dimension: event_id {
    primary_key: yes
    hidden: yes
  }
  dimension: distinct_id {
    hidden: yes
  }
  dimension: episode {
    sql: replace(${TABLE}.episode,'Drill to Detail','') ;;
  }
dimension: episode_number {
    sql: lpad(cast(${TABLE}.episode_number as string),2,'0') ;;
  }
dimension: nav_url {}
  dimension: referring_domain {}
  dimension: search_engine {}
  measure: event_count {
    type: count
  }
  measure: unique_device_count {
    type: count_distinct
    sql: ${TABLE}.distinct_id ;;
  }
measure: unique_devices {
    type: count_distinct
        sql: ${TABLE}.distinct_id ;;
  }
measure: episode_num {
    type: average
    sql: ${TABLE}.episode_number;;
  }
}

Then, a new explore is defined within the LookML project to join these two views and make them available for analysis in the Looker Explore page.

-- explore definition
explore: page_events  {
  join: unique_devices {
    relationship: one_to_many
    sql_on: ${page_events.distinct_id} = ${unique_devices.distinct_id} ;;
  }
}

Creating a look similar to the one a moment ago but using episode play events recorded by Mixpanel as the measure, I can see that whilst most episodes had the majority of downloads in the month they were published, the episode featuring Jonathan Palmer from King Games on how they used Looker to design and optimize games such as Candy Crush Saga had downloads well after the initial month it became available, telling me that content on this type of topic has perhaps a longer “shelf-life” than others we’ve covered.

1_-Ow4UumPL1f-PWYrEu2kQQ.png

If I use the “merge results” feature introduced with Looker 5 to combine results from two or more queries, I can combine the download numbers per episode per month per unique device ID from both Apple and from Mixpanel’s tracking of plays on the website to give me a combined listener count for each episode, and you can see from the screenshot below that the Episode 50 Special is still the most downloaded episode and the later one with Jonathan Palmer does seem to still be listened to for a lot longer than most other episodes.

1_BLGORMX1o0bXwXrJk0nvBQ.png

Except it’s not. What the chart above actually shows is that Episode 50 was the most downloaded in any one month, but if you take each episode and chart total downloads overall it’s actually the Christmas and New Year Special featuring Christian Berg that overall had more unique device downloads.

1_tsGaA0vUlcjDVgvdGlSokw.png

Putting the above two merged data looks into a dashboard along with two others showing the split by iTunes and website downloads over time and those same numbers expressed as share of all downloads over time, you can get a sense of for how long each episode was downloaded after first publication, the split for each episode by download platform and how listener numbers for each method of download have changed over time. You can see next chart how downloads via iTunes have a close correlation with new episodes being published with listener numbers dropping over the summer when the show took a break; in contrast, downloads directly from the website held-up even over the summer break when no new episodes were published.

1_Zhmt6XPbKfHoAhy0wpsj7A.png

Taking just the device-level data for website downloads coming from Mixpanel and creating a dashboard from the referrer and device details attributes it confirms as I’d have expected that most of those downloads come from listeners using Windows desktop PCs or Android phones, given iTunes’ focus on MacOS and iOS devices, but I can also see that most plays coming from search traffic come from Google and occasionally from Bing, and that Christian Berg’s listeners were unusually-spread amongst lots of different countries, not just the usual UK, USA, Germany and so on.

But what none of these dashboards tell me is how good I am at retaining listeners after they’ve downloaded their first ever episode; although I suspect many listeners come for one particular episode I’d be interested to know how many devices we see each month are returning or new and we can of course extend the scope of website activity to include clicks on navigation links as another indicator of engagement and retention.

To do this we need to know, for every visitor unique device ID, what sequence number in the total set of episode plays for that device this particular play represents; if the sequence number for a given play event was 1 then we’d classify the visitor as new, whereas if the sequence number was greater than 1 then we’d classify the visitor as returning. Sequencing these device playback events also then makes it possible to analyze play histories for individual podcast listener devices to understand how many episodes a typical listener plays and whether a particular episode triggers more repeat listenings of the show than others, to understand the frequency of listening and whether returning visitors are coming back to play the same episode or taking the opportunity to discover and play other episodes from the site — exactly the sort of analysis that online retailers do when looking to turn casual browsers and one-off purchasers into loyal, repeat buyers from their site.

To do this I create an SQL query in Looker’s SQL Runner utility that returns the episodes played for each visitor distinct_id and adds a ROW_NUMBER window function to provide a sequence number for the episode plays sorted in order of time

To add this query to my Looker project as a derived table I then select Get Derived Table LookML from the menu in the top right-hand side of the page, like this:

SELECT 
 distinct_id, 
 event_id, 
 time,
 episode_number,
 row_number() over (partition by distinct_id order by time) as play_order_seq
FROM
  `aerial-vehicle-148023.mixpanel.event` e
LEFT OUTER JOIN `aerial-vehicle-148023.dw_staging.episode_list` l
ON REPLACE(JSON_EXTRACT(e.properties,
      '$.Episode'),'"','') = l.episode
WHERE
  e.name = 'Podcast Episode Played'
1_jxWHFDG70QuKg8eLgcgkOg.png

and then choose the option to add it to your project when the Get Derived Table LookML dialog is shown.

1_9DraJHEWp0IpUSzZ1i6i0w.png

Then, after setting all of the derived table columns as hidden and the event_id column as the primary key for this view, I add one new dimension to the view that returns “new” or “returning” depending on whether the play order sequence number for the play event is 1 or any other number and also set the view_label for this view to the same name as the main page events view, so that the play order sequence number appears alongside the other play event fields in the main explore view.

view: website_play_sequence {
  view_label: "Page Events"
  derived_table: {
    sql: SELECT
       distinct_id,
       event_id,
       time,
       row_number() over (partition by distinct_id order by time) as play_order_seq
FROM
        `aerial-vehicle-148023.mixpanel.event` 
WHERE
        name = 'Podcast Episode Played'
       ;;
  }
dimension: distinct_id {
    hidden: yes
    type: string
    sql: ${TABLE}.distinct_id ;;
  }
dimension: event_id {
    hidden: yes
    primary_key: yes
    sql: ${TABLE}.event_id ;;
  }
dimension_group: time {
    hidden: yes
    type: time
    sql: ${TABLE}.time ;;
  }
dimension: play_order_seq {
    type: number
    sql: ${TABLE}.play_order_seq ;;
  }
dimension: new_or_returning {
    type: string
    sql: case when ${TABLE}.play_order_seq = 1 then 'new' else 'returning' end ;;
  }
}

I then join this new derived table to the page events view in the existing model explore on the event_id column common to both of them, like this:

explore: page_events  {
  join: unique_devices {
    relationship: one_to_many
    sql_on: ${page_events.distinct_id} = ${unique_devices.distinct_id} ;;
  }
  join: website_play_sequence {
    relationship: one_to_many
    sql_on: ${page_events.event_id} = ${website_play_sequence.event_id} ;;
  }

Now I can create a look that shows episode play events over time broken down by new or returning device, and I can see that about two-thirds of listens each month come from returning devices.

1_mdxRGHXwcEc8oZkmNL1DQg.png

If we pivot on the episode play sequence number and then turn the episode numbers themselves into a measure, we can then show for each individual device the sequence of episodes they’ve downloaded.

1_RfWkCwlRnmufmY6YZ9PuGA.png

So far we’ve looked at devices playing podcast episodes on the site as one, homogenous group, working on the assumption that listeners visiting the site for the first time now are no different from ones who found us a year or more ago, and the site itself hasn’t changed over that period of time. 

But in reality we may have improved the usability of the site or made it easier for visitors to find and discover other episodes, and so I’m keen to understand whether the groups, or “cohorts” of users who joined us in more recent months are engaged for longer and churn at a slower rate than visitors who found the site before those changes were made.

I can return the month that each visitor’s device first played a podcast episode on the site using another SQL query, and whilst I’m pulling this together I might as well calculate how long each device has been accessing episodes on the site along with totals for episode play events and distinct episodes listened to for each visitor device.

SELECT distinct_id, 
       date_trunc(date(min(time)),MONTH) as first_visit_at, 
       timestamp_diff(max(time),min(time),day)+1 as listener_days,
       count(*) as total_plays,
       count(distinct JSON_EXTRACT(properties,'$.Episode')) as total_episodes
FROM   mixpanel.event
WHERE  name = 'Podcast Episode Played'
GROUP BY 1

I then use the feature in SQL Runner that generates a derived table view definition from an SQL query and then fine-tune the dimension definitions and convert the listener_days, total_plays and total_episode columns into a set of appropriate measures.

view: visitor_cohort {
  view_label: "Unique Devices"
  derived_table: {
    sql: SELECT distinct_id,
       date_trunc(date(min(time)),MONTH) as first_visit_at,
       timestamp_diff(max(time),min(time),day)+1 as listener_days,
       count(*) as total_plays,
       count(distinct JSON_EXTRACT(properties,'$.Episode')) as total_episodes
FROM   mixpanel.event
WHERE  name = 'Podcast Episode Played'
GROUP BY 1
 ;;
  }
  
  dimension: distinct_id {
    primary_key: yes
    hidden: yes
    type: string
    sql: ${TABLE}.distinct_id ;;
  }
  
  dimension: first_visit_at {
    type: date
    label: "Listener Cohort"
    sql: ${TABLE}.first_visit_at ;;
  }
  
  measure: listener_days {
    type: average
    sql: ${TABLE}.listener_days ;;
  }
  
  measure: avg_episode_plays {
    type: average
    sql: ${TABLE}.total_plays ;;
  }
  
  measure: total_episode_plays {
    type: sum
    sql: ${TABLE}.total_plays ;;
  }
  
  measure: avg_episodes {
    type: average
    sql: ${TABLE}.total_episodes ;;
  }  
}

Then I join this second derived table into the model explore, this time joining to the unique_devices view on their common distinct_id column as the additional columns I’m now adding in apply to visitor devices, not episode play events as in the previous example.

explore: page_events  {
  join: unique_devices {
    relationship: one_to_many
    sql_on: ${page_events.distinct_id} = ${unique_devices.distinct_id} ;;
  }
  join: website_play_sequence {
    relationship: one_to_many
    sql_on: ${page_events.event_id} = ${website_play_sequence.event_id} ;;
  }
  join: visitor_cohort {
    relationship: one_to_one
    sql_on: ${unique_devices.distinct_id} = ${visitor_cohort.distinct_id} ;;
  }
}

I can now create looks such as the one below, where I chart the size of each cohort and overlay the average number of episode each visitor device goes on to download to give me a combined view of how many new listeners we recruit each month and whether each group over time listens to more, or less episodes than the previous ones.

1_Q7Rf49igY6Mg6fzKBV9aGw.png

If I now add two more window function calculations to the SQL query used by the derived table that provides the sequence number for each episode play by a device, like this:

SELECT
       distinct_id,
       event_id,
       time,
       row_number() over (partition by distinct_id order by time) as play_order_seq,
       date_diff(date(time),date(first_value(time) over (partition by distinct_id order by time)),MONTH) as months_since_first_play_at,
       date_diff(date(time),date(lag(time,1) over (partition by distinct_id order by time)),DAY) as days_since_last_play
FROM
        `aerial-vehicle-148023.mixpanel.event`
WHERE
        name = 'Podcast Episode Played'

I can now create a heat-map for each cohort showing how their engagement maintains or falls-off for the six months after their first play of a podcast episode on the site.

1_ipWdMIM607Rkmxh2OOMFFQ.png

If I switch the two dimensions around and change the visualization type to a line chart using a logarithmic scale, you can now see how engagement falls-off (or “decays”) for each of my listener cohorts over the six months since they first play an episode on the site.

1_GfyCbXM4O1oOZWkq37ewyA.png

Or, finally, I could take the Days Since Last Episode Play measure and use it to show how time between episode plays increases for each cohort, with big increases in time elapsed being a strong indicator of that cohort losing interest in the podcast and “churning”.

1_35JXcKGt5uEuweMBUDbvjw.png

So whilst we’ve been analyzing listeners to a podcast and in-detail, only a small and perhaps skewed subset of overall listeners, the techniques we’ve used including cohorting visitors based on when they first download or purchase and then tracking their engagement and overall lifetime value are ones we’ve used recently when working on Looker projects for clients such as Colourpop in the US and Florence back home in the UK

Drop us an email at info@mjr-analytics.com or contact us on +44 7866 568246 if you’re looking to do something similar with your customer or product data, and in the meantime you can check out past episodes of the Drill to Detail podcast a the new home for the show, co-located with this blog on the MJR Analytics site at https://www.mjr-analytics.com/podcast.