Posts tagged Bigquery
Customer Journey and Lifetime Value Analytics using Looker, Google BigQuery, Stitch and dbt

One of the most common analytics use-cases I come across on client projects is to understand the “customer journey”; the series of stages prospects and customers typically go through from becoming aware of a company’s products and services through to consideration, conversion, up-sell and repurchase and eventually, to churn.

Customers typically interact with multiple digital and offline channels over their overall lifetime, and if we take the model of a consulting services business such as Rittman Analytics, a typical customer journey and use of those touch-points might look like the diagram below.


Each customer and prospect interaction with those touch-points sends signals about their buying intentions, product and service preferences and readiness to move through to the next stage in the purchase funnel.


Creating visualization such as these in Looker typically involves two steps of data transformation either through a set of Looker (persistent) derived tables, or as we’ve done for our own operational analytics platform, through two dbt (Data Build Tool) models sequenced together as a transformation graph.

To start, we select a common set of columns from each digital touchpoint channel as brought in by our project data pipeline tool, Stitch; Hubspot for inbound and outbound marketing communctions together with sales activity and closes/loses; Harvest for billable and non-billable client days; Segment for website visits and so on. For each source we turn each type of customer activity they record into a type of event, details of the event and a monetary value (revenue, cost etc) that we can then use to measure the overall lifetime value of each client later on. 


Note the expressions at the start of the model definition that define a number of recency measures (days since last billable day, last contact day and so on) along with another set that are used for defining customer cohorts and bucketing activity into months and weeks since those first engagement date.

    {{ dbt_utils.datediff('last_billable_day_ts', current_timestamp(), 'day')}} AS days_since_last_billable_day,
    {{ dbt_utils.datediff('last_incoming_email_ts', current_timestamp(), 'day')}} AS days_since_last_incoming_email,
    {{ dbt_utils.datediff('last_outgoing_email_ts', current_timestamp(), 'day')}} AS days_since_last_outgoing_email,
    {{ dbt_utils.datediff('last_site_visit_day_ts', current_timestamp(), 'month')}} AS months_since_last_site_visit_day,
    {{ dbt_utils.datediff('last_site_visit_day_ts', current_timestamp(), 'week')}} AS weeks_since_last_site_visit_day
      ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY event_ts) AS event_seq,
      MIN(CASE WHEN event_type = 'Billable Day' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_billable_day_ts,
      MAX(CASE WHEN event_type = 'Billable Day' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_billable_day_ts,
      MIN(CASE WHEN event_type = 'Client Invoiced' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_invoice_day_ts,
      MAX(CASE WHEN event_type = 'Client Invoiced' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_invoice_day_ts,
      MAX(CASE WHEN event_type = 'Site Visited' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_site_visit_day_ts,
      MAX(CASE WHEN event_type = 'Incoming Email' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_incoming_email_ts,
      MAX(CASE WHEN event_type = 'Outgoing Email' THEN event_ts END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS last_outgoing_email_ts,
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS first_contact_ts,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type = 'Billable Day' THEN date(event_ts) END)          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},MONTH) AS months_since_first_billable_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type = 'Billable Day' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},WEEK) AS weeks_since_first_billable_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type like '%Email%' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},MONTH) AS months_since_first_contact_day,
      DATE_DIFF(date(event_ts),MIN(CASE WHEN event_type like '%Email%' THEN date(event_ts) END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }},WEEK) AS weeks_since_first_contact_day,
      MAX(CASE WHEN event_type = 'Billable Day' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS billable_client,
      MAX(CASE WHEN event_type LIKE '%Sales%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS sales_prospect,
      MAX(CASE WHEN event_type LIKE '%Site Visited%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS site_visitor,
      MAX(CASE WHEN event_details LIKE '%Blog%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS blog_reader,
      MAX(CASE WHEN event_type LIKE '%Podcast%' THEN true ELSE false END)
          {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS podcast_reader
  -- sales opportunity stages
            deals.lastmodifieddate AS event_ts,
          customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          deals.dealname AS event_details,
          deals.dealstage AS event_type,
          AVG(deals.amount) AS event_value
          {  } AS customer_master
          {  } AS deals
          ON customer_master.hubspot_company_id = deals.associatedcompanyids
          {  } AS owners
          ON deals.hubspot_owner_id = CAST(owners.ownerid AS STRING)
          deals.lastmodifieddate IS NOT null
      {  }
  -- consulting days
            time_entries.spent_date AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
 AS event_details,
          CASE WHEN time_entries.billable THEN 'Billable Day' ELSE 'Non-Billable Day' END AS event_type,
          time_entries.hours * time_entries.billable_rate AS event_value
          {  } AS customer_master
          {  } AS projects
          ON customer_master.harvest_customer_id = projects.client_id
          {  } AS time_entries
          ON time_entries.project_id =
          time_entries.spent_date IS NOT null
      {  }
  -- incoming and outgoing emails
            communications.communication_timestamp AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          communications.communications_subject AS event_details,
          CASE WHEN communications.communication_type = 'INCOMING_EMAIL' THEN 'Incoming Email'
               WHEN communications.communication_type = 'EMAIL' THEN 'Outgoing Email'
               ELSE communications.communications_subject
          END AS event_type,
          1 AS event_value
          {  } AS customer_master
          {  } AS communications
          ON customer_master.hubspot_company_id = communications.hubspot_company_id
          communications.communication_timestamp IS NOT null
      {  }
  -- sales opportunity stages
            invoices.issue_date AS event_ts,
            customer_master.customer_id AS customer_id,
          customer_master.customer_name AS customer_name,
          invoices.subject AS event_details,
          'Client Invoiced' AS event_type,
          SUM(invoices.amount) AS event_value
          {  } AS customer_master
          {  } AS invoices
          ON customer_master.harvest_customer_id = invoices.client_id
          invoices.issue_date IS NOT null
      {  }
         invoices.issue_date AS event_ts,
         customer_master.customer_id AS customer_id,
         customer_master.customer_name AS customer_name,
         invoice_line_items.description AS event_details,
         'Client Credited' AS event_type,
          COALESCE(SUM(invoice_line_items.amount ), 0) AS event_value
          {  } AS customer_master
          {  } AS invoices
          ON customer_master.harvest_customer_id = invoices.client_id
          {  } AS invoice_line_items
          ON = invoice_line_items.invoice_id
      {  }
         (COALESCE(SUM(invoice_line_items.amount ), 0) < 0)
              pageviews.timestamp AS event_ts,
              customer_master.customer_id  AS customer_id,
              customer_master.customer_name AS customer_name,
              pageviews.page_subcategory AS event_details,
              'Site Visited' AS event_type,
              sum(1) as event_value
          {{ ref('customer_master') }}  AS customer_master
          {{ ref('pageviews') }} AS pageviews 
          ON customer_master.customer_name =
      {{ dbt_utils.group_by(n=5) }}
              invoices.paid_at AS event_ts,
          customer_master.customer_id AS customer_id,
              customer_master.customer_name AS customer_name,
              invoices.subject AS event_details,
              CASE WHEN invoices.paid_at <= invoices.due_date THEN 'Client Paid' ELSE 'Client Paid Late' END AS event_type,
              SUM(invoices.amount) AS event_value
              {  } AS customer_master
          LEFT JOIN {  }  AS invoices
              ON customer_master.harvest_customer_id = invoices.client_id
            invoices.paid_at IS NOT null
          {  }
      customer_name NOT IN ('Rittman Analytics', 'MJR Analytics')

Deploying this model and then querying the resulting table, filtering on just a single customer and ordering rows by timestamp, you can see the history of all interactions by that customer with our sales and delivery channels.


You can also plot all of these activity types into a combined area and bar chart, for example, to show billable days initially growing from zero to a level that we maintain over several months, and with events such as client being invoiced, incoming and outgoing communications and the client actually paying in-full at the end.


We can also do things such as restrict the event type reported on to billable days, and then use Looker’s dimension group and timespans feature to split the last set of weeks into days and week numbers to then show client utilisation over that period.


Of course we can also take those monetary values assigned to events and use them to calculate the overall revenue contribution for this customer, factoring in costs incurred around preparing sales proposals, answering support questions and investing non-billable days in getting a project over the line.


Our second dbt model takes this event data and pivots each customer’s events into a series of separate columns, one for the first event in sequence, another for the second, another for the third and so on.


Again, code for this dbt model is available in our project git repo.

WITH event_type_seq_final AS (
        user_session_id AS event_type_seq,
            SUM(is_new_session) OVER (ORDER BY customer_id ASC, event_ts ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id,
            SUM(is_new_session) {{ customer_window_over('customer_id', 'event_ts', 'ASC') }} AS user_session_id
                CASE WHEN event_type != LAG(event_type, 1) OVER (PARTITION BY customer_id ORDER BY event_ts ASC) OR last_event IS NULL THEN 1
                ELSE 0
                END AS is_new_session
                    CASE WHEN event_type LIKE '%Email%' THEN 'Presales'
                    WHEN event_type LIKE '%Bill%' THEN 'Delivery' ELSE event_type END AS event_type,
                    LAG(CASE WHEN event_type LIKE '%Email%' THEN 'Presales' WHEN event_type LIKE '%Bill%' THEN 'Delivery' ELSE event_type END, 1) OVER (PARTITION BY customer_id ORDER BY event_ts ASC) AS last_event
                    {  }
                ORDER BY
                    customer_id ASC,
                    event_ts ASC,
                    event_type ASC
                ) last
             ORDER BY
                customer_id ASC,
                event_ts ASC,
                event_type ASC
           ) final
        ORDER BY
            customer_id ASC,
            is_new_session DESC,
            event_ts ASC
        is_new_session = 1
    {  }
    MAX(event_type_1) AS event_type_1,
    MAX(event_type_2) AS event_type_2,
    MAX(event_type_3) AS event_type_3,
    MAX(event_type_4) AS event_type_4,
    MAX(event_type_5) AS event_type_5,
    MAX(event_type_6) AS event_type_6,
    MAX(event_type_7) AS event_type_7,
    MAX(event_type_8) AS event_type_8,
    MAX(event_type_9) AS event_type_9,
    MAX(event_type_10) AS event_type_10
        CASE WHEN event_type_seq = 1 THEN event_type END AS event_type_1,
        CASE WHEN event_type_seq = 2 THEN event_type END AS event_type_2,
        CASE WHEN event_type_seq = 3 THEN event_type END AS event_type_3,
        CASE WHEN event_type_seq = 4 THEN event_type END AS event_type_4,
        CASE WHEN event_type_seq = 5 THEN event_type END AS event_type_5,
        CASE WHEN event_type_seq = 6 THEN event_type END AS event_type_6,
        CASE WHEN event_type_seq = 7 THEN event_type END AS event_type_7,
        CASE WHEN event_type_seq = 8 THEN event_type END AS event_type_8,
        CASE WHEN event_type_seq = 9 THEN event_type END AS event_type_9,
        CASE WHEN event_type_seq = 10 THEN event_type END AS event_type_10
{  }

Pivoting each customer’s events in this way makes it possible then to visualize the first n events for a given set of customers using Looker’s Sankey custom visualization, as shown in the final screenshot below.

Untitled 6.png

More details on our own internal use of Looker, Stitch, dbt and Google BigQuery to create a modern operational analytics platform can be found in this earlier blog post, and if you’re interested in how we might help you get started with Looker, check out our Services page.

How Rittman Analytics does Analytics: Modern BI Stack Operational Analytics using Looker, Stitch, dbt and Google BigQuery

As well as creating modern BI stack analytics solutions for clients such as Florence, Let’s Do This and Colourpop using technologies from our partners at Looker, Stitch, Segment, Fivetran and Snowflake, it’s probably no surprise that we’ve used those same technologies and project experience to build-out our own internal analytics platform, covering data and operational analytics use-cases such as

  • Showing high-level KPIs that show progress towards our four main business objectives

  • Providing actionable context for these high-level KPIs and the ability to drill-into the detail behind them

  • Enabling ad-hoc querying and analysis of our complete dataset by technical and non-technical users

  • Providing a trusted, integrated analysis-ready data platform for other internal projects


I thought it might be of interest to go through how we’ve used the tools and techniques we use for client projects but in this case to deliver an analytics solution for our own internal use, sharing some of our thinking around how we defined our underlying KPI framework, the tools and design patterns we used when building out the data management part of the platform, and some of the front-end components and data visualizations we’ve put together using the Looker BI tool.

Defining your KPI Framework

When creating any sort of analytics solution the first thing you need to understand is - what are your business objectives; what do you need to measure to understand your progress towards those objectives; and how do you then define success?

Key Performance Indicators (KPIs) and measures within our analytics platform are based around what's needed to measure our progress towards four business objectives we set ourselves for FY2019:

  1. Increase Billing Revenue

  2. Increase Profitability

  3. Increase customer retention, and

  4. Increase our operational and delivery efficiency

In-practice, we “certify” data around these four primary KPIs and also a number of secondary measures, with data in the platform normally up-to-date as of the previous business day and with historical data going back to when we started the business.


Other metrics, measures and dimension attributes are provided alongside these certified KPIs and measures and are considered “best endeavours” in terms of our certifying their accuracy, and users then of course able to create their own calculations for use in reports and analyses outside of this formal curation.

Platform Architecture

At a high-level our internal analytics platform extracts data on a regular basis from the SaaS-based services we use to run our business operations, joins that data together so that each dataset uses the same definition of clients, products and other reference data, then makes that combined dataset available for analysis as a set of business KPIs and an ad-hoc query environment.


The design of our internal analytics platform was based around a modular BI technology stack, ELT ("Extract, Land and Transform") data warehouse loading pattern and modern software development approach that we use on all of our client analytics development engagements.

Data Pipeline and Source Connectivity

We’re digital-first, paperless and run our business operations using cloud-hosted software services such as:

  • Harvest for timesheets, project invoicing and client-rechargeable expenses

  • Harvest Forecast for project resource planning and revenue forecasts

  • Xero for accounting and payroll

  • Hubspot CRM for sales and contact management

  • Docusign for contract management

  • Jira for project management and issue tracking

  • BambooHR for human resources and recruitment

On the delivery side, Rittman Analytics partners with four different data pipeline-as-a-Service software vendors, each of which targets a different segment of the market and that segment’s particular use-cases:

  • Stitch, who focus on small to mid-market startup customers with data engineers who value service extensibility and their land-and-expand pricing model ($100/month for up to five sources, including Hubspot and Xero certified and Harvest community-supported connectors)

  • Fivetran, aimed more at mid-market to enterprise customers and the data analyst personas, with a more turnkey solution and increased levels of service and coverage of enterprise data sources

  • Segment, more of an "enterprise service bus" for event-level digital data streams, connecting multiple event producers with downstream event consumers and offering additional features around customer journey mapping, personas and schema validation

  • Supermetrics, aimed at growth hackers and digital marketers with a focus on simple self-service setup, comprehensive data dictionaries and a comprehensive catalog of advertising, marketing and social data sources delivered through a Google Sheets add-in or more recently, a direct connector into Google BigQuery through the GCP Marketplace.

We chose to use Stitch's service for our data pipeline for this internal project due to their out-of-the-box ability to source data from Harvest and Harvest Forecast; functionally both Fivetran and Stitch would meet our needs other than support for these two particular sources, and we've used Fivetran extensively on other internal and client projects in the past.


We also use Segment's event tracking service to track visitor activity on our website and land those raw tracking events also in our BigQuery project, and intend to make use of Segment's Personas product in due course to enable us to build out 360-degree views of our clients, site visitors and prospects based on their interactions across all of our various digital touchpoints (web, social media, inbound and outbound lead generation, chat etc)

Data Transformation, Data Integration and Orchestration

We transform and integrate raw data that Stitch syncs from each of our SaaS application sources into an integrated, query-orientated data warehouse dataset using the open-source dbt toolkit.

Once we'd decided to work with a data pipeline-as-a-service such as Stitch together with a SQL-based data management platform like Google BigQuery, the decision to transform and integrate our data via a series of SQL SELECT was the obvious next design choice; by using dbt and version-controlling our scripts in a Github repository we increased our productivity when developing these transformation, adhered to modern software design principles and avoided cut-and-paste scripting in-favour of templated, maintainable code.


Transformation steps in dbt are basically SQL SELECTs with references to metadata models replacing hard-coded table names and use of macros as reusable, often cross-database transformation components giving three main benefits over hand-written, author-specific code:

  • Increased analyst productivity through leveraging libraries of best-practice transformation components

  • More maintainable loading processes with concise logic that “doesn’t repeat yourself” (DRY)

  • Portable code that can be deployed on BigQuery today or Snowflake tomorrow if required

For example, as shown in the example code snippet below.

    FROM (
            {{ dbt_utils.datediff(start_date, end_date, 'day')}} +1 AS forecast_days,
            a._sdc_sequence ,
            MAX(a._sdc_sequence) OVER (PARTITION BY ORDER BY a._sdc_sequence 
            {{ ref('harvest_forecast_assignments') }} a
        INNER JOIN
            {{ ref('harvest_forecast_people') }} p
            ON person_id =
        _sdc_sequence = latest_sdc_sequence

In addition to the open-source dbt toolkit we also use the commercial dbtCloud service from our friends at Fishtown Analytics, mainly in order to then be able to execute our transformation graph in the cloud, like this:


dbtCloud also hosts the data dictionary automatically generated from the metadata in our dbt package, and diagrams the transformation graph and data lineage we’ve defined through the dependencies we created between each of the models in the package.


As well as standardising each data source and mapping each source's set of customer references into a single, master customer lookup table that allows us to analyze operations from sales prospect through project delivery to invoicing and payment, we also turn every customer touchpoint into events recorded against a single view of each customer, like this:


We then assign a financial revenue or cost value to each event in the customer journey, calculate metrics such as time between interactions (to measure engagement), value delivered to the client and use that information to better understand and segment our clients so as to provide the advice and service they're looking for at this particular stage in their lifecycle and relationship with our team.


Analytics and Data Visualization

We then use Looker as our primary analytics tool over the data prepared by dbt and stored in Google BigQuery. Everyone's homepage in Looker is set by default to show those four main operational KPIs with comparison to target and with the last six months’ trend history, providing the context when combined with Looker’s ability to go deep into the detail behind those headline numbers so that users can decide and action on what they’re seeing (note that I’ve altered and obfuscated our actual numbers in the next few screenshots)


We've also created a single business operations-wide Looker explore that enables users to analyze and explore all aspects of our sales, delivery and financial relationship with clients, suppliers, partners and prospects.

In the example below we've again using the event model and applying sequencing to those events bucketed by months since first billable day, to help us understand how the the demand decay curve looks for each cohort of clients we've taken on over the first year we've been operating.


Embedded Analytics using Powered by Looker

We also link fields in our Looker explore back to the online services that can provide more background detail on a metric displayed on the screen, in the example below linking back to Harvest to see details of the project displayed in the explore data visualisation.


Finally, for team members who are more likely to be working within our knowledge-sharing and internal company portal site, we use Powered by Looker embedded analyics within Notion, connecting our internal analytics metrics and insights directly into the productivity and workflow tools used by the team, bringing data and analytics to the problems they’re looking to solve rather than the other way around.


So, hopefully this look behind the scenes at how we use these modern analytics stack technologies for our own analytics needs has been useful, and feel free to leave comments or contact me at if any of this would be applicable to your business. In the meantime, the dbt and Looker git repos used to build-out our platform are available as public git repos on Github:

Feel free to fork, improve on what we’ve done, submit PRs or just see a bit more of the detail behind the examples in this post.

Event-Level Digital Analytics using Google Analytics, Fivetran, BigQuery and Looker

A few weeks ago I posted a blog on using Mixpanel, Looker and Google BigQuery to analyze listener data from the Drill to Detail website, with Mixpanel tracking individual episode play events by visitors to the site and Fivetran replicating that event data over to BigQuery for analysis using Looker.

Mixpanel is great but like most digital businesses using Google Analytics (GA) to analyze and measure activity on their website, we actually use Google Tag Manager and GA events to track visitor activity on the website. Whilst a full site license for Google Analytics 360 is way beyond our budget right now, the free version of GA together with GTM provides us with an unsampled, event-level stream of visitor events and together with goals and simple A/B testing tools a way to start optimizing the content on our site. 

Couple all of this with Fivetran and Looker and we can start to do some interesting analytics on all this data to not only count sessions, visitors and pages viewed but also start looking at visitor retention and the routes those visitors take through the various features on our website.

Events in Google Analytics are a way of recording interactions visitors to your site make with items on a page such as selecting an item on a navigation menu, clicking on a download link or seeing a special offer or product recommendation. On the website we’ve setup events, for example, to record how far down the page a visitor scrolls so we can see how many people actually scroll past the hero image on our front page and look at the content on Looker services, Oracle services and so on.

Untitled 2.001.png

Another event records clicks on the specific hero image a visitor clicks on with another recording what navigation menu item was clicked on, in each case storing the image name or menu item clicked on as the event label.

Untitled 2.002.png

Whilst you can record events being triggered by adding custom Javascript to your website’s HTML page definitions, we use GTM’s graphical point-and-click interface to set-up our events and the visitor actions that trigger them.

Untitled 2.005.png

Enabling GTM’s debug mode for our site shows us which events then trigger when visiting our site, with the screenshot below showing a typical visitor journey through our homepage and the events that are triggered as they scroll-through and interact with various items on the page.

Untitled 2.003.png

Data on events being triggered along with event label and event action values are stored in a Javascript data layer that can also receive values sent over, for example, by an eCommerce application so that clicks on a product catalog item can record the price on offer as well as what product was clicked on. 

Looking at the Data Layer view in Google Tag Manager you can see the event entries recorded for this visitor to the site page, but note also how nothing personally identifiable is recorded except for a unique client ID set at the device/browser level that can then be used to analyze new vs. returning visitors and track visitor retention and engagement over time.

Untitled 2.004.png

As we then did with the Mixpanel example and the Drill to Detail website, we then use Fivetran to replicate all of the data layer events over to Google BigQuery, storing this granular visitor behavioural data in a BigQuery table and dataset. As you can see from the screenshot below we use Fivetran extensively within MJR Analytics; replicating invoices and all our other accounting data from Xero into BigQuery along with timesheet data from Harvest, CRM data from Hubspot, commits and PRs from Github, tasks outstanding and completed from Asana and search rankings and keywords from Google Search Console all into a Google BigQuery project whose datasets we then combine and analyze together using Looker.

Untitled 2.006.png

For the event data coming across from GA we use Fivetran to replicate the relevant event tracking dimensions and metrics into BigQuery every five minutes, giving us near real-time numbers on what’s popular and what’s engaging on our company website.

Untitled 2.007.png

As the event data from GA arrives as table rows of individual events being triggered, I then sessionize those individual events into rows of visitor sessions, using BigQuery SQL to pivot the incoming data and create individual columns for the first event, second event, first page view and first product item clicked on, for example, within an individual visitor session. The actual SQL used for the view is too long really to include here but to get the idea, here’s a condensed version with repeating parts removed for clarity.

  MAX(visitor_session_length_mins) AS visitor_session_length_mins,
    MAX(e.visitor_session_event_seq_num) OVER (PARTITION BY e.global_session_id) AS session_events_count,
      global_session_id = e.global_session_id
      AND event_category = 'Page View') AS page_view_count,
      global_session_id = e.global_session_id
      AND event_category = 'Nav Bar Clicks') AS nav_bar_click_count,
    e1.event_label AS event_1_label,
    e2.event_category AS event_2_category,
    e2.event_label AS event_2_label,
    pv.event_label AS page_view_1,
    pv2.event_label AS page_view_2,
    igc.event_label AS index_gallery_first_click
    `aerial-vehicle-148023.dw_staging.ga_events` e
      ROW_NUMBER() OVER (PARTITION BY global_session_id ORDER BY visitor_session_event_seq_num) page_view_seq_num
      event_category = 'Page View'
      3) pv
    e.global_session_id = pv.global_session_id
    AND pv.page_view_seq_num = 1
      ROW_NUMBER() OVER (PARTITION BY global_session_id ORDER BY visitor_session_event_seq_num) page_view_seq_num
      event_category = 'Page View'
      3) pv2
    e.global_session_id = pv2.global_session_id
    AND pv2.page_view_seq_num = 2
    `aerial-vehicle-148023.dw_staging.ga_events` e1
    e.global_session_id = e1.global_session_id
    AND e1.visitor_event_seq_num = 1
    `aerial-vehicle-148023.dw_staging.ga_events` e2
    e.global_session_id = e2.global_session_id
    AND e2.visitor_event_seq_num = 2

Creating a view or materialized table from this BigQuery SQL then gives me one record per session with event totals, page view and event paths along with session and client IDs that are perfect then for bringing into Looker for further analysis.

Untitled 2.008.png

Then finally it’s just a case of bringing that table of data into Looker as a LookML view and using one of Looker’s custom visualization blocks to display visitor event paths as a Sankey diagram, for example.

Untitled 2.009.png

Or we can aggregate and analyze counts of events from sessions to see in graphical form which offers, download items and navigation menu items are most popular over a given period.

Untitled 2.010.png

If you’d like to understand more about how MJR Analytics can help you better understand what visitors are looking at and interacting with on your website then drop me an email at; or even better, if you’re in Copenhagen on November 22nd and have the evening free, come along to the free meetup we’re hosting together with our partners CIMA and Looker and you’ll see me demo this all live, in person and show you how it works.

Introducing MJR Analytics … and How Two Years Go So Fast When You’re Learning Something New

Today I’m excited to be launching MJR Analytics, a new consulting company focusing on modern, cloud analytics projects using technology from Looker, Qubit, Fivetran, Oracle and Snowflake and development techniques learnt from my time working as an analytics product manager at a startup in London.


Our new website (and no that’s not me sitting in the chair)

So what have I been up to in the two years since I left my old consulting company, and how has that experience and the way I’ve been working with analytics technologies over that time inspired me to start another one?

Two years ago I announced on Twitter that I’d left the company I’d co-founded back in 2007 and intended to now take on a new challenge, and then spent the rest of the week at Oracle Open World cycling over the Golden Gate Bridge and coming back on the ferry and trying to decide what that challenge might actually be.


For most of my time in the IT industry I’d been working on projects implementing products from vendors such as Oracle and I’d always been interested in how these products came to market, how software vendors came up with a strategy and roadmap for those products how the team behind those products worked with the engineers who built them.

I’d also become increasingly interested in the startup world and towards the end of time time at Rittman Mead had taken-on an informal role advising Gluent, Tanel Poder and Paul Bridger’s product startup who were building software that enabled big enterprise customers to offload their data warehousing workloads from expensive proprietary databases onto to cheap, flexible storage and processing running on Hadoop clusters.

What appealed to me about working more formally with Gluent was the opportunity it gave me to work with two smart founders and an even smarter development team developing a product built entirely on big data technology I’d until then only scratched the surface with on consulting gigs. The product marketing role I took on was all about establishing what market that product was primarily intended for, how we went about positioning the product to appeal to that market and how we then brought that product to market.

Understanding these four things are crucial if you’re going to actually get customers to buy your startup’s product:

  • who is the buyer

  • what problem does your product solve

  • what is the value solving that problem, and

  • why you’re the first product to solve it for them

otherwise you’ll spend your time building a solution to a problem that nobody actually has, and that’s the reason the majority of tech startups end-up failing. Solving a problem for a market that’s willing to pay you money to solve is called “product/market fit” and if your product has it, and you’ve built your business such that it scales linearly as more customers discover your product, then you’re going to make a lot more money than a consultancy constrained by how many hours in the week your consultants can work and the upper limit on how much you can charge for a single person’s time.

I also learnt the distinction between product marketing, product management and product development in my time at Gluent. Going back to my time as a consultant attending product roadmap sessions at conferences I never quite knew which parts of the product team those speakers came from, but in summary:

  • Product Marketing is about taking a product that’s typically already built and then deciding the product’s positioning and messaging, then launching the product and ensuring the sales team, sales engineering and customers understand how it works and what it does; as such, this is a marketing role with a bit of technical evangelism thrown in

  • ProductDevelopment is the actual building of the product you’re looking to sell, and requires an engineering skillset together with the inspiration that typically came up with the product idea in the first place along and an entrepreneurial side that made you want to build a company around it

  • Product Management is more of a customer-facing role and is about understanding what your customers want and what their problems and use-cases are, and then creating a strategy, roadmap and feature definition for a product that will meet those needs

Despite my undoubted product marketing skills based around PowerPoint and internet memes:


In product marketing, it’s never too soon to put a Santa hat on a photo of the founder

in the end it I realised that it was product management that interested me the most and, after a couple of meetings with an old friend who used to run product management at Oracle for their business analytics product line and who had recently moved to London and now lead the product team team at Qubit, a technology startup created by four ex-Googlers building marketing technology products based-around Google’s big data and cloud technology, I joined their team later in 2016 as product manager responsible for the analytics features on their platform.

I spoke about the Qubit and the partnership we established with Looker back in May last year at a presentation at Looker’s JOIN 2017 conference in San Francisco and the slide deck below from that event goes into the background to the product and the problem it solves, helping customers using Qubit’s personalization platform make more effective use of the data we collected for them.

The product and data engineering teams at Qubit did an excellent job bringing together the features for this product and in hindsight, the bits I was most proud of included:

  • The business metadata layer we created on Google BigQuery and Google Cloud Platform to translate an event-level normalized many-to-many data model designed for fast data ingestion into a denormalized, dimensional data model designed for easy use with BI and ETL tools

  • Additional integration we created for the Looker BI tool including a set of industry vertical-specific Looker models and dashboards we then made available on the Looker Block Directory and in a Github public repo


Screenshot from Personalization Analytics Block for Looker by Qubit

  • The multi-tenant data warehouse and hosted Looker instance we then put together to enable customers without their own Looker instance to make use of their data in Google BigQuery, doing so in a way that supported per-tenant extensions and customizations by the customer or their implementation partner.


Technical Architecture for Live Tap as presented at Looker JOIN 2017

What I’ll take-away from my time at Qubit though was the incredible amount that I learnt about product management, product engineering, how to build and run a successful startup and team who are still highly-motivated seven years in and the introduction it gave me to the analytics and data-led world of digital marketing, eCommerce and modern data analytics platforms.

Consulting is a popular route into product management and the experience I brought to the role in areas such as business metadata models, analytical techniques and the needs of BI and ETL developers proved invaluable over the eighteen months I worked as part of Qubit’s product and engineering teams, but moving into product management within a young, technology-led startup founded by ex-Googlers and working with some of the smartest and most innovative people I’ve ever met involved learning a whole new set of skills including:

  • Developing on a new technology platform (Google Cloud Platform) within a new industry (eCommerce and digital marketing) and understanding a whole new set of analytics use-cases and customer roles (A/B testing, stats models and event-based analytics used by analysts and strategists within eCommerce businesses) that I described in a presentation at last year’s UK Oracle User Group Tech Conference in Birmingham:

  • Working as part of a team rather than directing that team, and managing -up as well as down, a technique I had to relearn pretty quickly in my first few months in the role

  • Learning to achieve my goals through influence rather than in the top-down way I’d been used to getting things done leading customer projects, and as CTO and owner of the company that team worked for

  • Saying no to customers rather than yes as you did as a consultant, as your objective is to build a product that solves the most important customer needs but doesn’t burden you with so many features addressing niche use-cases that you end up with Homer’s car and can’t innovate the product in future releases

  • How to take a product through its lifecycle from identifying a need that makes sense for your company to meet, through prototyping, alpha and beta releases to successful first launch and then creating a strategy and roadmap to manage that product over its complete lifecycle

  • How to use a new generation of modern, cloud-native data analytics tools such as Looker together with products such as FiveTran, Google Cloud Platform, Qubit, Snowflake DB and Snowplow Analytics that were increasingly also being adopted by the FinTech, MarTech and B2C startups clustering in London and other European/North American tech hubs

I learnt so much from my colleagues at Qubit about products, engineering and building a successful and motivated team that put up with my jokes and built the most technologically-advanced marketing personalization platform on the market.

But what my time at Qubit also made clear to me was that, when it came down to it, what really motivated me to get up in the morning, learn all these new technologies and still be wildly excited to come into work in the morning twenty years later was:

  • using data and analytics to find new insights and uncover new opportunities in a customer’s data set

  • working with that individual clients, over time, to enable them to find more of those insights and opportunities themselves

  • find new innovations in analytics technologies and how we deliver projects to make this process cheaper, faster and more likely to be successful

  • and building a team, and crucially a business, to do all of this at scale and offer a full set of analytics-related consulting services built around modern analytics tools and delivery techniques

Which is why after two years away from the consulting business and two enjoyable, rewarding and enlightening years working on the other side of the data and analytics industry I’m now launching my new consulting company, MJR Analytics; and I hope to be working with many of you as clients or members of our team over the coming months and years.

Date Partitioning and Table Clustering in Google BigQuery (and Looker PDTs)

Google BigQuery is a data warehousing-orientated “table-as-a-service” product from Google Cloud Platform that, like Oracle Exadata, optimizes for full-table scans rather than selective row access and stores data organized into columns, rather than rows, to align better with filtering and aggregation workloads associated with data warehousing workloads.

BigQuery charges by amount of data stored in your tables and the data you’ve read in executing SQL queries against those tables, so BI tools such as Looker that work efficiently with BigQuery only request the columns they need when querying BigQuery datasets rather than running a SELECT(*) and throwing away what’s not needed.

To illustrate this let’s run a query that requests all columns (“SELECT (*) FROM …”) from a BigQuery table, and as you can see from the screenshot below it’s reading through all 1.47GB of table data to return that full set of columns to the user.


If the users’ query only really needed just two of those columns, requesting just those brings down the amount of data read to just under 10MB as most of of that table’s data is stored in columns of data that aren’t needed for the query.


BigQuery historically has supported table partitioning based on the date you loaded data into the table which sounds great until you realise that it’s the transaction date, not the ingest date, that most of your user’s queries filter against.

You could also use table decorators in Legacy SQL to point to the particular day partition your data was stored within but this only went back for a maximum of seven days and required your query tool to support this non-standard feature of SQL; earlier in this year though Google introduced a more flexible form of date partitioning as a beta release feature that allows you to choose the date column your table would be partitioned by, and more recently introduced a feature called table clustering that stores data within a table sorted by the columns you most commonly filter on when running queries against it.

To show how date partitioning and table clustering work, I’ll start by running a query to return just a month’s data from the five years of data held within my table; as you can see in the screenshot below, BigQuery performs a full table scan and reads through all 1.37 GB of data in the table to return just the month of data my query requested.


Standard SQL now supports DDL commands such as CREATE TABLE and CREATE TABLE … AS SELECT, along with a PARTITION BY clause that lets you specify a timestamp or date column to partition the table by. I’ll use these new features to create a copy of that same table, this time partitioned by the timestamp column I’m filtering on in my query predicate …


… and the DDL statement fails. What’s happened there then?

Turns out that BigQuery tables are limited to 2500 partitions for the moment, with any one single load operation limited to 2000 (hence the error) and with partitioning limited to just date and timestamp columns and partitions a day in length it means any one table can only hold around five years worth of data, beyond that you’ll need to create multiple date partitioned tables and UNION them together through a SQL view.

For now though I load my table with just five years of data and then re-run the query that requests a single day from that overall five years; now BigQuery has only read and processed 57 MB of data and it’d be a fraction of that if I only requested the two columns I needed, not all columns from that table.


But what about queries that filter against the other columns in this table? We can’t set up table partitioning on STRING, INTEGER or any other type of column datatype so my original query if re-run against the date partitioned table reads just as much data as it did before.


What we could do is re-create the table with its data pre-sorted by those two particular columns using another new feature called table clustering, so that queries that filter against those columns find the blocks of data they’re looking for faster and can skip completely the ones that don’t.

If like me you’re more familiar with databases such as Oracle, “table clustering” is all about storing data from tables sharing a common cluster key together in the same data blocks, so that queries against that group of tables filtering on that key return data faster.

Table clustering in BigQuery is more analogous to loading regular Oracle tables using data from a pre-sorted file and comes with the same benefits and limitations; in BigQuery’s case it takes care of the pre-sorting and table configuration for you but the same limitations still apply around how you filter the table and what happens when you load more data afterwards.

Let’s set-up a clustered table now that stores its data ordered by the two columns used in the query I ran a few moments ago.


Now when I run a query filtering on those columns against this partitioned, clustered table the amount of data read goes down compared to before, and results are returned a bit faster; if I included the partition key column in the query as well, returning just a few days’ data, it’d be faster still.


But queries that filter using any of the other columns don’t benefit from this clustering and moreover, the benefits of sorting the data initially loaded into a clustered table are lost over time as you load more (unsorted) data into it, meaning that to really benefit from clustering a table you have to rebuild it regularly using a CREATE TABLE … AS SELECT.

Table clustering in BigQuery is nice to have but similar to pre-sorting a direct path load into Oracle database tables, it’ll take a lot of rebuilding and careful querying to get the most benefit from and with the size of most BigQuery tables, I doubt that rebuilding will happen much in-practice.

BI tools such as Looker can make use of table partitioning and clustering in queries straight away as no changes are required in the query SQL you write, everything is handled in the table definition. Where you might want to set up partitioning yourself as a Looker developer is for the persistent derived tables (PDTs) that Looker can create to materialize the results of a view you define in your Looker model to derive a dimension attribute using a subquery, for example to calculate the sequence number of a users’ order for retention analysis for an eCommerce dashboard as shown in the screenshot below.


Looker has for some time come with database-specific settings for particular source database types such those used for Redshift used in the example above, and now supports date partitioning for persistent derived tables through a new partition_keys setting as announced in this recent Looker forum post.


Finally, if you’re interested in how partitioning is developing as a feature within BigQuery, and some of the edge-cases and requests for additional functionality that users of partitioning are asking for, this feature request on the BigQuery issue tracker is useful to read-through.

Extending support to more than 2500 partitions seems to be the most popular request along with allowing integer and string datatype columns to be used for the partition key, but also look out for issues around re-loading data into historic partitions and the cost and work involved in rebuilding large tables to re-cluster its data or apply partitioning for the first time.

First Impressions of Oracle Autonomous Data Warehouse Cloud

Regular readers of this blog will know that my day-to-day work involves ingesting, storing and then analyzing petabytes of data using Google BigQuery, one of a new generation of fully-managed, cloud-hosted data warehouse platforms-as-a-service. Services like BigQuery take care of all the database infrastructure management and capacity planning leaving you just to define tables, load data into them and then run queries using SQL or your favourite BI tool.

Oracle recently announced Oracle Autonomous Data Warehouse Cloud (ADWC) that competes in this market and given my background in Oracle technology I was interested to see how it stacked-up compared to BigQuery and similar services from the likes of Amazon AWS and Snowflake. To do this I took advantage of Oracle’s free $300 service credit offer and signed-up for an ADWC instance – note that at the time of writing the ADWC service is only available in Oracle Public Cloud’s North American datacenter region as opposed to the European ones so make sure you choose the former when creating your Oracle Cloud account.

Setting up an Oracle on-premises database was something once-upon-a-time I’d spent months and thousands on course fees studying for certifications to be able to do confidently, so it was nice to see how things had got a lot simpler with setting up an ADWC instance with the first step being to select ADWC from the Oracle Cloud Services menu, like this:

Then answer a few simple questions: the name to give the instance, number of CPUs and amount of storage to set aside initially, and the password for the administrator account.

Setup of an instance takes a few minutes, then its ready. Not bad.

The next big change you notice compared to Oracle’s customer-managed Database Cloud Service is the replacement of Enterprise Manager Database Express 12c and DBaaS Console with a single Service Console, accessed not through the usual SYS and SYSTEM superuser accounts but through a new superuser account called ADMIN, presumably adopting a naming standard more familiar to users of other SaaS and PaaS platforms.

The Service Console contains a couple of performance-tracking pages with the emphasis on read-only viewing of those stats vs. giving you controls to fine-tune database parameters or table structures, and a page where you can download a zip file containing a wallet directory with TNSNAMES.ORA and encrypted connection details to your ADWC instance.

I seem to remember at the time of Autonomous Data Warehouse Cloud’s launch that many in the Oracle community (myself included, most probably) said that all that the “autonomous” part of Autonomous Data Warehouse Cloud did was automate the Oracle Database maintenance tasks that more recent data warehousing platforms-as-a-service handled for you automatically. Another common critique at that time was that under the covers ADWC was more scripting and use of existing “advisors” than true autonomous artificial intelligence-style database management.

To be honest though, it doesn’t matter; compared to running a regular Oracle database you just fill in a short form and the platform runs without any further intervention from you, you don’t need to know how it works, job done in terms of competing with BigQuery and Redshift for ease-of-use and maintenance.

Connecting to ADWC is a bit different to regular Oracle Database connections in that it’s easier — no need to specify host, port and SID/service name — but this involves downloading a ZIP file wallet and the tool you wish to connect having support for that wallet. Oracle SQL Developer in its most recent 18c version has that support making it the obvious tool to define new users, create tables and upload data into them and then run simple SQL queries against your data.

Oracle ADWC is based on Oracle Exadata Database Server technology and supports PL/SQL, full INSERT/UPDATE/DELETE support in SQL as well as SELECT, parallel query (enabled by default) and hybrid columnar compression.

What you don’t get are indexes and materialized views as these generally aren’t considered necessary when running on Exadata hardware that’s optimized for full table scans of data arranged in compressed columnar format, and as ADWC charges per CPU per hour there’s no penalty in scanning lots of detail-level data unnecessarily as you get with BigQuery.

Also missing are features such as the Oracle OLAP Option, Oracle Spatial and Graph and Oracle Text that I’m guessing might get implemented at a later date, along with management features such as Oracle Tuning Pack that are no longer needed when the autonomous features of the platform do the database tuning for you.

Price-wise, although as I said a moment ago you do get charged by the amount of data you scan (and a bit for storage) with Google BigQuery, Oracle AWDC seems to be more expensive to run at least at the data volumes I’d be using as a developer initially evaluating the product.

If you commit to monthly charging then ADWC costs around $1200/cpu/month whereas Google charge $5/TB for queries with the first 1TB each month free plus a token amount for storage and streaming inserts, which meant my bill last month for BigQuery came to just $1.53 and my total bill for all Google Cloud Platform services including multiple hosted VMs and extensive usage of ther NLP and Geocoding APIs took the total up to just over $150. Clearly Google are aiming more at the startup market whereas Oracle are serving enterprise customers, but don’t make the assumption that just because AWDC is elastic it’s also super-cheap for hobbyist use, long-term.

So now its a case of loading some sample data into ADWC using SQL Developer and then running a couple of test queries to see it working in action with my data.

Tomorrow, I’ll show how Looker be connected using the newly announced support it has for Oracle ADWC and later on in the week do the same for Oracle Analytics Cloud and DV Desktop.

The Drill to Detail Podcast : 50th Episode Special and Top 10 Episodes by Download

I started the Drill to Detail Podcast series back in 2016 in order to talk about some of the innovations coming out of the big data analytics space with the product teams, industry analysts and consultants working together to deliver new-world, cloud-hosted big data analytics platforms.

A year-and-a-half later we’re at Episode 50 and to mark the occasion Stewart Bryson, our first guest on the show back at Episode 1 and a regular on the show since, comes back on and is joined by Alex Gorbachev to talk about Google Cloud Platform, Looker, FiveTran and SnowflakeDB, with the episode now available for download along with all 49 other episodes on the Drill to Detail Podcast website …

… and as we’re now at the 50th episode of the show, here’s the Top 10 episodes by download statistics as of January 2018 — thanks to everyone who’s come on the show and everyone who’s listened to the show via iTunes and other podcast providers, and congratulations to Maxime Beauchemin:

  1. Drill to Detail Ep.26 ‘Airflow, Superset & The Rise of the Data Engineer’ with Special Guest Maxime Beauchemin
  2. Drill to Detail Ep.21 ‘The Fall, and Rise, of Microsoft BI’ With Special Guest Chris Webb
  3. Drill to Detail Ep.31 ‘Dremel, Druid and Data Modeling on Google BigQuery’ With Special Guest Dan McClary
  4. Drill to Detail Ep.34 ‘Qlik, MSAS and the Strategy of Data Analytics’ with Special Guest Donald Farmer
  5. Drill to Detail Ep.48 ‘Mondrian OLAP, Apache Calcite and Database Dis-Aggregation’ With Special Guest Julian Hyde
  6. Drill to Detail Ep.25 ‘SAP BusinessObjects Analytics, AI and Digital Transformation’ With Special Guest Timo Elliott
  7. Drill to Detail Ep.27 ‘Apache Kafka, Streaming Data Integration and Schema Registry’ with Special Guest Gwen Shapira
  8. Drill to Detail Ep. 5 ‘SnowflakeDB, and is Data Modeling Dead?’ with Special Guest Kent Graziano
  9. Drill to Detail Ep.23 ‘Looker, BigQuery and Analytics on Big Data’ With Special Guest Daniel Mintz
  10. Drill to Detail Ep.6 ‘Data Capital and the Economics of Big Data’ with Special Guest Paul Sonderegger
Monzo, BigQuery, Looker and “Fintech”: The Other London Tech Startup Scene

I’ve written a number of blogs over the past few months about bringing my fitness, smart devices and other digital data sources into Google BigQuery and Looker to do interesting analysis with the combined dataset, whilst also keep my tech skills up to date and relevant for my day job as Product Manager for Analytics at Qubit, a London tech startup in the hot new MarTech (“Marketing Technology”) space.

At Qubit we also use Google BigQuery and Looker to enable our customers to run fast, interactive queries against petabytes of consumer behavioural data whilst BigQuery abstracts away the hassles in running a production data warehouse, whilst Looker abstracts away the complexities of SQL, joining datasets together and trying to visualize it all in D3 or RStudio.

One big gap in the data I’d managed to collect up until recently was anything to do with what I actually spent each day; whilst my current bank along with most others provides a means to manually download statement data in Excel or CSV format, or supports commercial services that load that data into your personal finance software for a monthly fee, I wanted to access my data in the same way that I accessed my health, timekeeping and cycling workout data … through a REST API, cURL and a cron job that loads new data into BigQuery on the hour every day.

So that’s why I smiled when I heard about Monzo, another tech startup just a mile or so from my office in London that’s disrupting another industry: consumer banking, with Monzo, Starling and other new app-only banks taking advantage of new regulations such as the EU’s new Payment Services Directive (PSD2) that open up existing bank accounts to authorised third-parties looking to provide a better user experience (UX) over your existing bank arrangements; or as in Monzo’s case, actually obtaining a banking license and offering a complete end-to-end service that focuses on solving customer problems rather than you providing a cheap source of money to lend for the bank.

Along with other technology-led startups working in the financial services space Monzo are the best example yet of a compelling new financial services company with a strong appeal to consumers more familiar today with Facebook than chequebooks.

Image Courtesy of Medici, “Awe-Inspiring London FinTech Industry Is Firing on All Cylinders”

Me, I was more interested in the fact that Monzo also use Looker and Google BigQuery to run their analytics infrastructure, and that they also offer a developer-orientated REST API that provides exactly what I’ve been looking for to enable myself and increasingly others in the developer community to download, analyze and visualize their spending data and start to build applications that when the full Monzo API comes out will revolutionize how we do our banking in the coming years.

To start analyzing my banking data through Monzo I first needed to do a historic batch download of all my historic transactions and top-ups and put that into a BigQuery table, and then setup a trigger on Monzo’s systems that sends across all the subsequent transactions as they happen to keep my data in-sync with Monzo’s record of my bank account.

To set up the initial transaction history transfer I first registered at and then generated an access token for my original pre-paid Monzo card and the current account card that recently replaced it. Note that this access token only lasts for a short time so you can’t generate it just once and use it in a script forever, and more importantly the developer API is meant just for personal use and can’t be used to access other people’s accounts, so you can’t build one of the new apps made possible by the UK’s new open banking regulations just yet (but this is coming soon through their interim AIS API now in testing with a small number of third-party developers)

To find out the account IDs for your Monzo accounts either use the API playground web app as I’ve done in the screenshot above, or use the Developer API now for the first time to get them yourself, via the cURL tool:

curl -G “" -H “Authorization: Bearer <<YOUR_ACCESS_TOKEN_HERE>>" ./accounts.json

and then retrieve your account IDs from the JSON output through a tool such as jsonpp (brew install jsonpp if you’re on a Mac, similar tools for other OSs)

"accounts": [
"id": "acc_<<MY_PREPAID_CARD_ID>>"
"created": "2017-04-06T07:07:15.175Z",
"description": "Mark Rittman",
"type": "uk_prepaid"
"id": "acc_<<MY_CURRENT_ACCOUNT_ID>>"
"created": "2017-11-28T08:35:28.323Z",
"description": "Mark Robert Rittman",
"account_number": "<<MY_AC_NUMBER>",
"sort_code": "<<MY_SORTCODE>>",
"type": "uk_retail"

Then you can retrieve all your account transactions with another cURL request, like this one for my current account:

curl -G "" -d account_id=acc_<<MY_CURRENT_ACCOUNT_ID>> -H "Authorization: Bearer <<YOUR_ACCESS_TOKEN_HERE>>" > /transactions.json

and then you can use tools such as “sed” (to strip out the transactions: []) array from the json output) and one of the many json2csv converters out there (for example this one that has npm as a prerequisite, something I’m guessing most people attempting this sort of project will have already, if not then first install npm via brew and you’re ready)

sed -i -e 's/{"transactions"://g' ./transactions.json
sed -i -e 's/]}/]/g' ./transactions.json
brew install json2csv
json2csv -i ./transactions.json -f created,amount ,description,currency,category,include_in_spending,<<ANY_OTHER_REQUIRED_FIELDS>> -o transactions.csv

There are many more fields available to extract from the JSON documented downloaded via the developer API depending on whether you’re accessing a Monzo prepaid card or current account; the developer docs on the Monzo site go through some of them but you’re best going through a handful of transactions to see what’s available and then create your own custom -f extract list.

Then as the final step I use the bq command-line tool to load the data in the CSV file output in the previous step into a BigQuery table, like this:

bq load --source_format=CSV --replace=true --skip_leading_rows=1 --autodetect dw_staging.monzo_transactions_stg ./transactions.csv

and then go into the BigQuery Web UI to check my transaction data has come in as expected.

In-practice for my analysis exercise I brought in a lot more fields including merchant name and map X/Y location, Foursquare metadata and other fields that are only available for the current account product, so check through the JSON elements available for your particular Monzo product and extract the ones that most interest you.

So what about capturing transactions after this date, how do we do that without some complicated process that downloads all the transactions again and then applies just the ones that are new to the BigQuery table? The way I bring in all my other app, smart device and other event data into BigQuery is to either post them directly to a fluentd server running on Google Compute Engine or use services such as IFTTT and Zapier that interface with my apps and then post to fluentd using outgoing webhook services, or their “Maker Channel” as IFTTT used to call their service when I created for a blog post back in 2016.

The Monzo Developer API also comes with a webhook feature that lets you register a URL, for example the one use when posting all my other event data to fluentd, with all subsequent transaction data then being posted to that server URL and then onto the various BigQuery tables that store my data for analysis.

In the case of the Monzo transactions I added an intermediate step and had the Monzo webhook post to a Zapier “zap” that made it easier to decode the nested JSON elements Monzo was sending over that held merchant and other payment processor-related information.

Zapier then sends over just those elements of the transaction I’m interested in to the fluentd server which then streams them into BigQuery using the fluentd BigQuery plugin. From there I can then write SQL to analyze this data in BigQuery’s web UI, remembering to divide the transaction amount by 100 as they’re all stored as pence in integer datatypes within the Monzo system, and to remove any top-ups, declines and and other non-spend transactions from the results by filtering on the “include_in_spending” column to remove all records with “false” as the predicate value.

Analyzing the data in a tool such as Looker makes things much more interesting as I can start to make use of the locational metadata for merchants that Monzo provide for current account transactions, showing me for example where I took out cash from ATMs on the way back home from the pub and then forgot I’d done so the morning after.

Or I can see how my spend on essentials like entertainment and eating out rises in the summer, especially when I appear to do away with non-essentials like food and other groceries.

I can see where it was I spend the most money when out on cycling trips in the summer, mostly in pubs around the New Forest and up near Bath, Frome and Warminster (and how I wish winter would end and I can get back out on my bike again).

I can even work out how my choice of restaurant and entertainment changed from lots of spend on music in August through to lots of eating out in October.

And more importantly, I can see I’m spending far too much money on coffees and breakfast on the way into work each day, but that’s the downside I guess of working just off Covent Garden in what must be one of the nicest work locations I’ve ever had the privilege to base myself out of.

The beauty of Looker is that I can define a data model in LookML that takes table structures and SQL expressions and turns them into more user-friendly measure and dimension definitions, like this:

These LookML “views” can then be joined to other views that share a common join key — in my case, the truncated “date” part of a date and time timestamp field — that then makes it possible for me to analyze spend data in the context of where I was on that day and that time; whether I was out on the bike and if so whether I’d had a good workout or not so good; what the weather was like on the day and what I’d been chatting about on Slack at the time; whether the spend was for an Uber ride and if so where to.

Analysis that shows me, now I’m finally able to join my financial and activity dots together for the first time, where all my money’s gone since the nights got darker and I’m clearly in the pub rather than out on my bike. Lucky spring is around the corner or I’ll be applying for Monzo’s new overdraft service that’s just gone into review rather than registering new Strava personal records next month.

Druid, Imply and Looker 5 bring OLAP Analysis to BigQuery’s Data Warehouse

Back in the good old days of on-premise data warehousing projects where data was structured, the software came on CDs and data engineering was called ETL development (and was considered the most boring job on a project, how times have changed) a typical technical architecture had sources on the left-hand side, a relational database hosting the data warehouse in the middle, and a multi-dimensional OLAP (or “MOLAP”) server on the right-hand side serving up data — fast — to specialist query and reporting tools.

Diagram courtesy of Readings in Database Systems, 3rd Edition, Stonebraker & Hellerstein, eds, 1996.

OLAP (Online Analytical Processing) Servers were a great complement to the relational database servers that hosted data warehouses back in those days by taking a subset of the whole dataset, structuring it into a dimensional model then storing it as a indexed arrays of leaf-level and pre-computed aggregates, served up results with split-second response times for any combination of dimension filters or level of aggregation requested … and users loved them.

The first ten years of my consulting career were spent working with OLAP Servers and if you’re interested in their history I’d recommend you check out Episode 37 of the Drill to Detail Podcast where I talked about arguably the last pure MOLAP Server still actively sold and implemented, Oracle’s Essbase Server, with Cameron Lackpour; Episode 34 of the Drill to Detail Podcast with Donald Farmer, the original Product Manager behind Microsoft Analysis Services and Episode 11 of the Drill to Detail Podcast with Graham Spicer, my original mentor and the single person most responsible for the nearly twenty-year career I’ve had since then in consulting, data warehousing and now product management, on Oracle’s original Express Server and then OLAP Option technologies.

But just like mainframes that reached perfection just at the time when PCs and mini-computers made them obsolete, OLAP Servers fell out of favour as data volumes and data types exploded whilst time available for loading them via batch processes just disappeared. On-premise data warehouses eventually transformed into elastic, cloud-hosted data warehouse platforms provided as fully-managed services such as Google BigQuery, Snowflake DB and Oracle Autonomous Data Warehouse Cloud Service and were accompanied by a new generation of BI tools like Looker, aimed at data-driven tech startups needing to analyze and understand vast amounts of consumer activity and other event-level behavioural data, as I talked about in my session at the recent Looker Join 2017 conference in San Francisco on Qubit, BigQuery and Looker for petabyte-scale analytics.

But BigQuery is, at the end of the day, a query and compute engine optimized for data warehouse-style queries and workloads albeit at a scale unimaginable ten years ago; Druid, an open-source project first announced in a white paper back in 2014 and now arguably the standard for new-world distributed data stores optimized this time for sub-second response times, may be the OLAP Server to BigQuery’s data warehouse.

To be clear, BigQuery and other distributed query engines like it are fast, particularly when filtering, sorting and aggregating single wide tables of columnar-organized data as you can see in the video below where I query and aggregate around four-and-a-half million smart device events to find out the average monthly temperature in each of the rooms in my house.


BigQuery supports joins between large tables, uses ANSI-standard SQL and more recently has benefited from a number of improvements to improve the response time for small queries as well as large ones, but compared to OLAP servers that typically pre-compute in-advance all the different aggregations and store data indexed and organized by the dimensions that users filter results by, it’s definitely a general-purpose database engine rather than a single-purpose OLAP server, and all query aggregations have to be computed on-the-fly.

Druid, originally authored by Eric Tschetter and Fangjin Yang at Metamarkets in 2011 and described in-detail in this white paper from 2014 explicitly re-implements key features of old-school OLAP servers by pre-aggregating incoming real-time streaming and batch data and storing it in a more compressed form, organizes that compressed data as time-based segments bitmap-indexed by dimensions and then presents data out as OLAP cubes to client applications.

Image courtesy of “What is Druid”, image downloaded in Oct. 2017

Druid has some significant limitations compared to more general-purpose analytic database engines such as BigQuery; it doesn’t support table joins right now (though it may do at the time you read this, as an open-source project it evolves rapidly), its primary client interface is JSON over HTTP, and most importantly for organizations that moved to BigQuery because it runs as infrastructure-as-a-service you have to take care of server upgrades, capacity scaling and all the other infrastructure management tasks that we thought we’d said goodbye to with data warehouse-as-a-service platforms.

But companies offering services and management tools to manage Druid as just another platform service are starting to emerge and courtesy of the Apache Calcite project it’s now possible to query Druid using regular SQL queries, a capability Looker recently took advantage of to offer Druid connectivity as one of the new features in their recent Looker 5 release, as you can see me demonstrating in the video below.


But just as old-school OLAP servers worked best with query tools specifically designed to work with them, new open-source BI tools such as Superset (from the same engineering team at Airbnb that also brought us Airflow, my current ETL orchestration tool of choice) connect directly to Druid clusters and come close to their commercial rivals in terms of reporting and dashboard features offered to end users; in the video below you can see me creating a percentage-changed line-graph showing how the amount of time I spend cycling each month changed over time, using the same Druid datasource as in the other videos.


Superset, Looker and other BI tools that now support Druid are of course great but the one that’s really got my interest, and prompted me to look further into Druid and how it complements BigQuery and other data warehouse cloud platform cloud services is Imply, a startup launched by one of the original co-authors of Druid who, not unlike Looker who reinvented the enterprise BI platform for the big data and startup world, are reintroducing that same world to OLAP analysis whilst making the Druid back-end much easier to manage.

Imply runs either on-premise as open-source software you can download and then install on local or cloud-hosted VMs, or run as platform-as-a-service through a contract with Imply. Druid is one of the more complex and involved analytic database types to load but Imply’s cloud service makes it simple to spin-up, manage and then ingest data into your Druid cluster either as real-time streaming sources, or via batch loads from Amazon S3 or other popular datasources.

Images courtesy of Imply Cloud Quickstart docs page

I’ve got Imply running on my own Google Compute Engine infrastructure-as-a-service platform so take care of server management and data ingestion manually, but for me the standout feature in Imply’s platform is Pivot, their open-source OLAP query tool. If any reader is old enough to remember OLAP client tools such as Oracle Express Sales Analyzer and ProClarity you’ll recognize Pivot’s use of terms such as cubes, dimensions, aggregation types and measures as shown in the screenshot of my setup below…

… but more importantly, you’ll recognise the structured query environment and lightning-fast response to my queries against that same set of four-and-a-half million IoT and other events that I extracted from my BigQuery environment and then loaded and stored in compressed column-stored segments pre-aggregated and indexed by the same dimension fields I’m now analysing it by.


Well they say nothing’s new in fashion or music if you wait long enough, and sure enough as I said in my tweet a couple of years ago…


… yes it does make me feel old, but it’s great to see such a powerful concept as multidimensional OLAP storage and dimensional models being rediscovered by the big data and startup worlds.

Qubit’s Journey to Petabyte-Scale Machine-Learning and Analytics on Google Cloud Platform … and…

My guest on this week’s Drill to Detail Podcast is Alex Olivier from Qubit, a startup based in London founded by four ex-Googlers that uses big data technology and machine learning to deliver personalized experiences and product recommendations to customers of some of the biggest names in e-commerce, travel and online gaming.

Qubit’s innovation in the space was to move beyond simple A:B testing and cookie-based personalization to create an event-level, petabyte-scale customer activity data lake running in Google Cloud that enables retailers and other organizations deliver personalize offers and site features based on a much more granular understanding of customer behavior and preferences, the same vision in fact that I kept putting forward in my webinars and presentations last year around data reservoirs and customer 360-degree analysis.

And of course, this centralized, event-level store of customer activity and purchase preferences creates a fantastic platform on which to build predictive models, real-time next-best-offer decision engines…

… and enable real-time big data analytics — with Qubit’s product in this area, Live Tap, being what I’ve been working on since last year’s Openworld advising their Product Management team and working under Paul Rodwick, who some of you might know from his time as head of Oracle’s BI Product Development team. I’ll write about Live Tap and the work I’m doing there, and my experiences creating an analytics product on top of BigQuery at a later date, as well as our use of Looker to create a semantic model over Qubit’s event-level data lake.

The podcast episode with Alex Olivier talks about Qubit’s journey from initially using Amazon AWS to land and process data using S3 buckets and MapReduce, then moving it all onsite to a cluster of thousands of HBase region servers storing data ingested and processed using Storm with latency down to four hours, to their current setup using Google BigQuery, Google PubSub and Google Cloud Dataflow processing 100,000 events per second and making it all available to customers with latency around 5 seconds — seriously impressive and a great case study around the use of cloud-hosted, elastically-provisioned big data analytics platforms.

There’s also a video of Alex presenting on Qubit’s architecture at last year’s Google NEXT event in London on Youtube, and this episode along with all the others is also available for download on the iTunes Podcast Directory where you can subscribe for free and automatically download new episodes as they become available Tuesday of every week.