Posts tagged Looker
Looker London Meetup on Thursday, February 7th 2019 - Registration Now Open!
meetup2.png

If you’re in London on Thursday 7th February 2019 and a developer or end-user working with the Looker Data Platform, we’re very pleased to be launching together with Jon Palmer from Go Cardless the first ever Looker London Dev Meetup, sponsored by the good folks at Looker and hosted at Go Cardless’ HQ at 65 Goswell Road, London EC1V 7EN.

The aim is to create a space for showcasing and discussing what data people in London are building in and around the Looker ecosystem, and for this first meetup we have three great customer and partner speakers sharing their insights on Looker development and project implementations:

  • "Scale securely with the Looker API" Dylan Baker, Consultant

  • "Internal and External Multi-tenancy with Looker" - Ian Thompson, King

  • "Increase End-User Engagement with Looker 6 System Activity Model" - Mark Rittman, MJR Analytics

followed by a panel session where the speakers are joined by one of our guests from Looker:

  • “Data Analytics Panel - Hear from Industry experts and Looker users from GoCardless, King, MJR analytics and Looker.”

Registration and attendance is free but spaces are limited, so make sure you register soon if you’d like to come along - the meetup starts at at 5.30pm with drinks, food and introductions from Jon, full agenda is on the meetup page and we’re aiming to wrap-up around 9pm with closing comments from me and a short survey on future meetup content.

Event-Level Digital Analytics using Google Analytics, Fivetran, BigQuery and Looker

A few weeks ago I posted a blog on using Mixpanel, Looker and Google BigQuery to analyze listener data from the Drill to Detail website, with Mixpanel tracking individual episode play events by visitors to the site and Fivetran replicating that event data over to BigQuery for analysis using Looker.

Mixpanel is great but like most digital businesses using Google Analytics (GA) to analyze and measure activity on their website, we actually use Google Tag Manager and GA events to track visitor activity on the mjr-analytics.com website. Whilst a full site license for Google Analytics 360 is way beyond our budget right now, the free version of GA together with GTM provides us with an unsampled, event-level stream of visitor events and together with goals and simple A/B testing tools a way to start optimizing the content on our site. 

Couple all of this with Fivetran and Looker and we can start to do some interesting analytics on all this data to not only count sessions, visitors and pages viewed but also start looking at visitor retention and the routes those visitors take through the various features on our website.

Events in Google Analytics are a way of recording interactions visitors to your site make with items on a page such as selecting an item on a navigation menu, clicking on a download link or seeing a special offer or product recommendation. On the mjr-analytics.com website we’ve setup events, for example, to record how far down the page a visitor scrolls so we can see how many people actually scroll past the hero image on our front page and look at the content on Looker services, Oracle services and so on.

Untitled 2.001.png

Another event records clicks on the specific hero image a visitor clicks on with another recording what navigation menu item was clicked on, in each case storing the image name or menu item clicked on as the event label.

Untitled 2.002.png

Whilst you can record events being triggered by adding custom Javascript to your website’s HTML page definitions, we use GTM’s graphical point-and-click interface to set-up our events and the visitor actions that trigger them.

Untitled 2.005.png

Enabling GTM’s debug mode for our site shows us which events then trigger when visiting our site, with the screenshot below showing a typical visitor journey through our homepage and the events that are triggered as they scroll-through and interact with various items on the page.

Untitled 2.003.png

Data on events being triggered along with event label and event action values are stored in a Javascript data layer that can also receive values sent over, for example, by an eCommerce application so that clicks on a product catalog item can record the price on offer as well as what product was clicked on. 

Looking at the Data Layer view in Google Tag Manager you can see the event entries recorded for this visitor to the site page, but note also how nothing personally identifiable is recorded except for a unique client ID set at the device/browser level that can then be used to analyze new vs. returning visitors and track visitor retention and engagement over time.

Untitled 2.004.png

As we then did with the Mixpanel example and the Drill to Detail website, we then use Fivetran to replicate all of the data layer events over to Google BigQuery, storing this granular visitor behavioural data in a BigQuery table and dataset. As you can see from the screenshot below we use Fivetran extensively within MJR Analytics; replicating invoices and all our other accounting data from Xero into BigQuery along with timesheet data from Harvest, CRM data from Hubspot, commits and PRs from Github, tasks outstanding and completed from Asana and search rankings and keywords from Google Search Console all into a Google BigQuery project whose datasets we then combine and analyze together using Looker.

Untitled 2.006.png

For the event data coming across from GA we use Fivetran to replicate the relevant event tracking dimensions and metrics into BigQuery every five minutes, giving us near real-time numbers on what’s popular and what’s engaging on our company website.

Untitled 2.007.png

As the event data from GA arrives as table rows of individual events being triggered, I then sessionize those individual events into rows of visitor sessions, using BigQuery SQL to pivot the incoming data and create individual columns for the first event, second event, first page view and first product item clicked on, for example, within an individual visitor session. The actual SQL used for the view is too long really to include here but to get the idea, here’s a condensed version with repeating parts removed for clarity.

SELECT
  global_session_id,
  visitor_session_id,
  visitor_id,
  visitor_session_start_date_time,
  MAX(visitor_session_length_mins) AS visitor_session_length_mins,
  visitor_first_event_week_cohort,
  city,
  session_events_count,
  ...
  event_1_category,
  event_1_label,
  ...
  page_view_1,
  page_view_2,
  ...
FROM (
  SELECT
    e.global_session_id,
    e.visitor_session_id,
    e.visitor_id,
    e.visitor_session_start_date_time,
    e.visitor_session_length_mins,
    e.visitor_first_event_week_cohort,
    e.city,
    MAX(e.visitor_session_event_seq_num) OVER (PARTITION BY e.global_session_id) AS session_events_count,
    (
    SELECT
      COUNT(*)
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      global_session_id = e.global_session_id
      AND event_category = 'Page View') AS page_view_count,
    SELECT
      COUNT(*)
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      global_session_id = e.global_session_id
      AND event_category = 'Nav Bar Clicks') AS nav_bar_click_count,
    ...,
    e1.event_label AS event_1_label,
    e2.event_category AS event_2_category,
    e2.event_label AS event_2_label,
    ...
    pv.event_label AS page_view_1,
    pv2.event_label AS page_view_2,
    ...,
    igc.event_label AS index_gallery_first_click
  FROM
    `aerial-vehicle-148023.dw_staging.ga_events` e
  LEFT OUTER JOIN (
    SELECT
      global_session_id,
      event_label,
      ROW_NUMBER() OVER (PARTITION BY global_session_id ORDER BY visitor_session_event_seq_num) page_view_seq_num
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      event_category = 'Page View'
    ORDER BY
      3) pv
  ON
    e.global_session_id = pv.global_session_id
    AND pv.page_view_seq_num = 1
  LEFT OUTER JOIN (
    SELECT
      global_session_id,
      event_label,
      ROW_NUMBER() OVER (PARTITION BY global_session_id ORDER BY visitor_session_event_seq_num) page_view_seq_num
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      event_category = 'Page View'
    ORDER BY
      3) pv2
  ON
    e.global_session_id = pv2.global_session_id
    AND pv2.page_view_seq_num = 2
  ...
  JOIN
    `aerial-vehicle-148023.dw_staging.ga_events` e1
  ON
    e.global_session_id = e1.global_session_id
    AND e1.visitor_event_seq_num = 1
  LEFT OUTER JOIN
    `aerial-vehicle-148023.dw_staging.ga_events` e2
  ON
    e.global_session_id = e2.global_session_id
    AND e2.visitor_event_seq_num = 2
  ...
GROUP BY
  1,
  2,
  ...,
  46
ORDER BY
  1

Creating a view or materialized table from this BigQuery SQL then gives me one record per session with event totals, page view and event paths along with session and client IDs that are perfect then for bringing into Looker for further analysis.

Untitled 2.008.png

Then finally it’s just a case of bringing that table of data into Looker as a LookML view and using one of Looker’s custom visualization blocks to display visitor event paths as a Sankey diagram, for example.

Untitled 2.009.png

Or we can aggregate and analyze counts of events from sessions to see in graphical form which offers, download items and navigation menu items are most popular over a given period.

Untitled 2.010.png

If you’d like to understand more about how MJR Analytics can help you better understand what visitors are looking at and interacting with on your website then drop me an email at mark.rittman@mjr-analytics.com; or even better, if you’re in Copenhagen on November 22nd and have the evening free, come along to the free meetup we’re hosting together with our partners CIMA and Looker and you’ll see me demo this all live, in person and show you how it works.

Media Analytics for the Drill to Detail Podcast using Looker, Fivetran and Mixpanel

Most readers of the MJR Analytics blog will no doubt be aware of the Drill to Detail Podcast on the big data analytics industry that we present and host, with the most recent episode covering the Looker JOIN 2018 event in San Francisco with Tristan Handy and Stewart Bryson as special guests. Things have been a bit busy over the past month with the launch of MJR Analytics and then the Looker and Oracle conferences but we try and upload a new episode every few weeks, with the next episode — fingers crossed — having a rather special guest if we can get our schedules aligned.

Listeners to the show typically either use services such as Apple’s iTunes Podcast Directory to download episodes and play them typically on their iOS mobile devices or they can visit the show’s website at https://www.drilltodetail.com where they can download or play episodes through an embedded audio player and click through the links in the show notes that accompany each episode.

Apple provides statistics at episode level to podcast owners via their new Podcast Analytics service but you’re limited to just the charts and fixed ways that the app allows you to analyze and explore your dataset. If you can find a way to scrape or download this data and load it into a database service such as Google BigQuery though, then you could instead use a tool such as Looker to analyze and visualize your listener data in more interesting ways that are also very applicable to customer and product analytics for any digital or eCommerce site, as I’ll explain as we go through the examples.

To start, the Looker visualization below shows for each episode of the show how many in-total of that episode were listened to by users of Apple’s Podcasts Directory service on their iPhones, iPads and Mac OS devices, with the single most listened-to episode of all time within a given month being our 50th Episode Special that featured Stewart Bryson and Alex Gorbachev as our two special guests.

1_ilhCgPXXaqpil4tKSkJd2g.png

Apple also provide data on the how much of each episode was completed by the listener, so you can get a sense of how long episodes should last and how engaged your audience was with individual episodes. In the Looker visualization below you can see that typically our listeners get through around half to two-thirds of each episode but were particularly engaged with Wes McKinney episode on Apache Arrow, Python Pandas and his blog post comparing Arrow with ORC and Parquet that was generating a lot of discussion in the open-source and big data community at the time.

1_mzVle2MGHq9KlDgpwB7aQQ.png

So far the way I’ve analysed downloads and listener numbers for the podcast is not too dissimilar to how businesses used to only measures sales per month but didn’t really think about how many of their customers and visitors coming onto their site each month were new or in-fact returning customers looking to buy more. 

These days knowing how many of your customers are new or returning is the first thing you’d want to understand as an eCommerce manager because it’s cheaper to sell to an existing customer than to go out and recruit a new one, and one of your main priorities is to try and turn those first-time buyers into loyal, repeat purchasers who then become your most valuable customers.

My focus on episodes rather than listeners was in-reality more due to Apple not providing episode download and listening data at the granularity of an individual user or device ID, but some time ago I also set up Mixpanel, a Javascript event-tracking service similar to Heap and Google Analytics’s custom events, to track when visitors played episodes of the show on the site through their browser as shown in the screenshots below.

1_9-dnYMTQqFJ1E-pfBnmNDg.png

Data is still only logged at unique device level as there’s no registration facility on the site or other means to tie downloads to particular users, but now we have access to additional details of each download such as where the user was located and what type of browser and OS they were using.

Mixpanel stores the events it tracks in a database service it hosts, but you can use a managed data replication service such as the one from our partners at Fivetran, shown in the screenshots below, to copy that data into a warehousing platform such as BigQuery as I’ve done and shown below.

1_NmcCG2ET75sX-M4htrOO7Q.png

Going back to Looker I now create two new LookML views against the single table of Mixpanel tracked events that Fivetran replicated into BigQuery for me, the first of which is keyed on the unique distinct_id columns and contains dimensions and measures for the listener’s device used for accessing the site and playing-back episodes of the show.

view: unique_devices {
  derived_table: {
    sql: SELECT
  distinct_id,
  device,
  mp_country_code,
  os, region, screen_height, screen_width
FROM
  `aerial-vehicle-148023.mixpanel.event`
WHERE
  name IN ('Clicked Nav Link',
    'Podcast Episode Played')
group by 1,2,3,4,5,6,7 ;;
  }
  dimension: distinct_id {
    primary_key: yes
    label: "Device ID"
  }
  dimension: device {
    sql: case when ${TABLE}.os in ('Windows','Mac OS X','Linux') then 'Desktop'
              when ${TABLE}.os = 'Chrome OS' then 'Chromebook'
              else device end;;
  }
  dimension: os {}
  dimension: region {}
  dimension: screen_height {}
  dimension: screen_width {}
  dimension: mp_country_code {
    label: "Country"
  }
  measure: count_unique {
    type: count_distinct
    sql: ${TABLE}.distinct_id ;;
  }
measure: count {
    type: count
  }
}

The second view is keyed on event_id and contains the dimensions and measures used for analyzing and counting playback events by those listeners’ devices.

-- page events view definition
view: page_events {
  derived_table: {
    sql: SELECT
  REPLACE(JSON_EXTRACT(e.properties,
      '$.Episode'),'"','') AS episode,
      REPLACE(JSON_EXTRACT(e.properties,
      '$.url'),'"','') AS nav_url,
      episode_number, 
      distinct_id,
      current_url, 
      event_id, 
      initial_referrer, 
      initial_referring_domain, 
      name as event_type, 
      referrer, 
      referring_domain, 
      search_engine, 
      time
FROM
  `aerial-vehicle-148023.mixpanel.event` e
left outer join `aerial-vehicle-148023.dw_staging.episode_list` l
on REPLACE(JSON_EXTRACT(e.properties,
      '$.Episode'),'"','') = l.episode
WHERE
  e.name IN ('Clicked Nav Link',
    'Podcast Episode Played') ;;
  }
dimension_group: event {
    type: time
    timeframes: [
      date,
      week,
      month,
      hour_of_day,
      day_of_week,
      day_of_month
    ]
    sql: ${TABLE}.time ;;
  }
  dimension: event_type {}
dimension: event_id {
    primary_key: yes
    hidden: yes
  }
  dimension: distinct_id {
    hidden: yes
  }
  dimension: episode {
    sql: replace(${TABLE}.episode,'Drill to Detail','') ;;
  }
dimension: episode_number {
    sql: lpad(cast(${TABLE}.episode_number as string),2,'0') ;;
  }
dimension: nav_url {}
  dimension: referring_domain {}
  dimension: search_engine {}
  measure: event_count {
    type: count
  }
  measure: unique_device_count {
    type: count_distinct
    sql: ${TABLE}.distinct_id ;;
  }
measure: unique_devices {
    type: count_distinct
        sql: ${TABLE}.distinct_id ;;
  }
measure: episode_num {
    type: average
    sql: ${TABLE}.episode_number;;
  }
}

Then, a new explore is defined within the LookML project to join these two views and make them available for analysis in the Looker Explore page.

-- explore definition
explore: page_events  {
  join: unique_devices {
    relationship: one_to_many
    sql_on: ${page_events.distinct_id} = ${unique_devices.distinct_id} ;;
  }
}

Creating a look similar to the one a moment ago but using episode play events recorded by Mixpanel as the measure, I can see that whilst most episodes had the majority of downloads in the month they were published, the episode featuring Jonathan Palmer from King Games on how they used Looker to design and optimize games such as Candy Crush Saga had downloads well after the initial month it became available, telling me that content on this type of topic has perhaps a longer “shelf-life” than others we’ve covered.

1_-Ow4UumPL1f-PWYrEu2kQQ.png

If I use the “merge results” feature introduced with Looker 5 to combine results from two or more queries, I can combine the download numbers per episode per month per unique device ID from both Apple and from Mixpanel’s tracking of plays on the website to give me a combined listener count for each episode, and you can see from the screenshot below that the Episode 50 Special is still the most downloaded episode and the later one with Jonathan Palmer does seem to still be listened to for a lot longer than most other episodes.

1_BLGORMX1o0bXwXrJk0nvBQ.png

Except it’s not. What the chart above actually shows is that Episode 50 was the most downloaded in any one month, but if you take each episode and chart total downloads overall it’s actually the Christmas and New Year Special featuring Christian Berg that overall had more unique device downloads.

1_tsGaA0vUlcjDVgvdGlSokw.png

Putting the above two merged data looks into a dashboard along with two others showing the split by iTunes and website downloads over time and those same numbers expressed as share of all downloads over time, you can get a sense of for how long each episode was downloaded after first publication, the split for each episode by download platform and how listener numbers for each method of download have changed over time. You can see next chart how downloads via iTunes have a close correlation with new episodes being published with listener numbers dropping over the summer when the show took a break; in contrast, downloads directly from the website held-up even over the summer break when no new episodes were published.

1_Zhmt6XPbKfHoAhy0wpsj7A.png

Taking just the device-level data for website downloads coming from Mixpanel and creating a dashboard from the referrer and device details attributes it confirms as I’d have expected that most of those downloads come from listeners using Windows desktop PCs or Android phones, given iTunes’ focus on MacOS and iOS devices, but I can also see that most plays coming from search traffic come from Google and occasionally from Bing, and that Christian Berg’s listeners were unusually-spread amongst lots of different countries, not just the usual UK, USA, Germany and so on.

But what none of these dashboards tell me is how good I am at retaining listeners after they’ve downloaded their first ever episode; although I suspect many listeners come for one particular episode I’d be interested to know how many devices we see each month are returning or new and we can of course extend the scope of website activity to include clicks on navigation links as another indicator of engagement and retention.

To do this we need to know, for every visitor unique device ID, what sequence number in the total set of episode plays for that device this particular play represents; if the sequence number for a given play event was 1 then we’d classify the visitor as new, whereas if the sequence number was greater than 1 then we’d classify the visitor as returning. Sequencing these device playback events also then makes it possible to analyze play histories for individual podcast listener devices to understand how many episodes a typical listener plays and whether a particular episode triggers more repeat listenings of the show than others, to understand the frequency of listening and whether returning visitors are coming back to play the same episode or taking the opportunity to discover and play other episodes from the site — exactly the sort of analysis that online retailers do when looking to turn casual browsers and one-off purchasers into loyal, repeat buyers from their site.

To do this I create an SQL query in Looker’s SQL Runner utility that returns the episodes played for each visitor distinct_id and adds a ROW_NUMBER window function to provide a sequence number for the episode plays sorted in order of time

To add this query to my Looker project as a derived table I then select Get Derived Table LookML from the menu in the top right-hand side of the page, like this:

SELECT 
 distinct_id, 
 event_id, 
 time,
 episode_number,
 row_number() over (partition by distinct_id order by time) as play_order_seq
FROM
  `aerial-vehicle-148023.mixpanel.event` e
LEFT OUTER JOIN `aerial-vehicle-148023.dw_staging.episode_list` l
ON REPLACE(JSON_EXTRACT(e.properties,
      '$.Episode'),'"','') = l.episode
WHERE
  e.name = 'Podcast Episode Played'
1_jxWHFDG70QuKg8eLgcgkOg.png

and then choose the option to add it to your project when the Get Derived Table LookML dialog is shown.

1_9DraJHEWp0IpUSzZ1i6i0w.png

Then, after setting all of the derived table columns as hidden and the event_id column as the primary key for this view, I add one new dimension to the view that returns “new” or “returning” depending on whether the play order sequence number for the play event is 1 or any other number and also set the view_label for this view to the same name as the main page events view, so that the play order sequence number appears alongside the other play event fields in the main explore view.

view: website_play_sequence {
  view_label: "Page Events"
  derived_table: {
    sql: SELECT
       distinct_id,
       event_id,
       time,
       row_number() over (partition by distinct_id order by time) as play_order_seq
FROM
        `aerial-vehicle-148023.mixpanel.event` 
WHERE
        name = 'Podcast Episode Played'
       ;;
  }
dimension: distinct_id {
    hidden: yes
    type: string
    sql: ${TABLE}.distinct_id ;;
  }
dimension: event_id {
    hidden: yes
    primary_key: yes
    sql: ${TABLE}.event_id ;;
  }
dimension_group: time {
    hidden: yes
    type: time
    sql: ${TABLE}.time ;;
  }
dimension: play_order_seq {
    type: number
    sql: ${TABLE}.play_order_seq ;;
  }
dimension: new_or_returning {
    type: string
    sql: case when ${TABLE}.play_order_seq = 1 then 'new' else 'returning' end ;;
  }
}

I then join this new derived table to the page events view in the existing model explore on the event_id column common to both of them, like this:

explore: page_events  {
  join: unique_devices {
    relationship: one_to_many
    sql_on: ${page_events.distinct_id} = ${unique_devices.distinct_id} ;;
  }
  join: website_play_sequence {
    relationship: one_to_many
    sql_on: ${page_events.event_id} = ${website_play_sequence.event_id} ;;
  }

Now I can create a look that shows episode play events over time broken down by new or returning device, and I can see that about two-thirds of listens each month come from returning devices.

1_mdxRGHXwcEc8oZkmNL1DQg.png

If we pivot on the episode play sequence number and then turn the episode numbers themselves into a measure, we can then show for each individual device the sequence of episodes they’ve downloaded.

1_RfWkCwlRnmufmY6YZ9PuGA.png

So far we’ve looked at devices playing podcast episodes on the site as one, homogenous group, working on the assumption that listeners visiting the site for the first time now are no different from ones who found us a year or more ago, and the site itself hasn’t changed over that period of time. 

But in reality we may have improved the usability of the site or made it easier for visitors to find and discover other episodes, and so I’m keen to understand whether the groups, or “cohorts” of users who joined us in more recent months are engaged for longer and churn at a slower rate than visitors who found the site before those changes were made.

I can return the month that each visitor’s device first played a podcast episode on the site using another SQL query, and whilst I’m pulling this together I might as well calculate how long each device has been accessing episodes on the site along with totals for episode play events and distinct episodes listened to for each visitor device.

SELECT distinct_id, 
       date_trunc(date(min(time)),MONTH) as first_visit_at, 
       timestamp_diff(max(time),min(time),day)+1 as listener_days,
       count(*) as total_plays,
       count(distinct JSON_EXTRACT(properties,'$.Episode')) as total_episodes
FROM   mixpanel.event
WHERE  name = 'Podcast Episode Played'
GROUP BY 1

I then use the feature in SQL Runner that generates a derived table view definition from an SQL query and then fine-tune the dimension definitions and convert the listener_days, total_plays and total_episode columns into a set of appropriate measures.

view: visitor_cohort {
  view_label: "Unique Devices"
  derived_table: {
    sql: SELECT distinct_id,
       date_trunc(date(min(time)),MONTH) as first_visit_at,
       timestamp_diff(max(time),min(time),day)+1 as listener_days,
       count(*) as total_plays,
       count(distinct JSON_EXTRACT(properties,'$.Episode')) as total_episodes
FROM   mixpanel.event
WHERE  name = 'Podcast Episode Played'
GROUP BY 1
 ;;
  }
  
  dimension: distinct_id {
    primary_key: yes
    hidden: yes
    type: string
    sql: ${TABLE}.distinct_id ;;
  }
  
  dimension: first_visit_at {
    type: date
    label: "Listener Cohort"
    sql: ${TABLE}.first_visit_at ;;
  }
  
  measure: listener_days {
    type: average
    sql: ${TABLE}.listener_days ;;
  }
  
  measure: avg_episode_plays {
    type: average
    sql: ${TABLE}.total_plays ;;
  }
  
  measure: total_episode_plays {
    type: sum
    sql: ${TABLE}.total_plays ;;
  }
  
  measure: avg_episodes {
    type: average
    sql: ${TABLE}.total_episodes ;;
  }  
}

Then I join this second derived table into the model explore, this time joining to the unique_devices view on their common distinct_id column as the additional columns I’m now adding in apply to visitor devices, not episode play events as in the previous example.

explore: page_events  {
  join: unique_devices {
    relationship: one_to_many
    sql_on: ${page_events.distinct_id} = ${unique_devices.distinct_id} ;;
  }
  join: website_play_sequence {
    relationship: one_to_many
    sql_on: ${page_events.event_id} = ${website_play_sequence.event_id} ;;
  }
  join: visitor_cohort {
    relationship: one_to_one
    sql_on: ${unique_devices.distinct_id} = ${visitor_cohort.distinct_id} ;;
  }
}

I can now create looks such as the one below, where I chart the size of each cohort and overlay the average number of episode each visitor device goes on to download to give me a combined view of how many new listeners we recruit each month and whether each group over time listens to more, or less episodes than the previous ones.

1_Q7Rf49igY6Mg6fzKBV9aGw.png

If I now add two more window function calculations to the SQL query used by the derived table that provides the sequence number for each episode play by a device, like this:

SELECT
       distinct_id,
       event_id,
       time,
       row_number() over (partition by distinct_id order by time) as play_order_seq,
       date_diff(date(time),date(first_value(time) over (partition by distinct_id order by time)),MONTH) as months_since_first_play_at,
       date_diff(date(time),date(lag(time,1) over (partition by distinct_id order by time)),DAY) as days_since_last_play
FROM
        `aerial-vehicle-148023.mixpanel.event`
WHERE
        name = 'Podcast Episode Played'

I can now create a heat-map for each cohort showing how their engagement maintains or falls-off for the six months after their first play of a podcast episode on the site.

1_ipWdMIM607Rkmxh2OOMFFQ.png

If I switch the two dimensions around and change the visualization type to a line chart using a logarithmic scale, you can now see how engagement falls-off (or “decays”) for each of my listener cohorts over the six months since they first play an episode on the site.

1_GfyCbXM4O1oOZWkq37ewyA.png

Or, finally, I could take the Days Since Last Episode Play measure and use it to show how time between episode plays increases for each cohort, with big increases in time elapsed being a strong indicator of that cohort losing interest in the podcast and “churning”.

1_35JXcKGt5uEuweMBUDbvjw.png

So whilst we’ve been analyzing listeners to a podcast and in-detail, only a small and perhaps skewed subset of overall listeners, the techniques we’ve used including cohorting visitors based on when they first download or purchase and then tracking their engagement and overall lifetime value are ones we’ve used recently when working on Looker projects for clients such as Colourpop in the US and Florence back home in the UK

Drop us an email at info@mjr-analytics.com or contact us on +44 7866 568246 if you’re looking to do something similar with your customer or product data, and in the meantime you can check out past episodes of the Drill to Detail podcast a the new home for the show, co-located with this blog on the MJR Analytics site at https://www.mjr-analytics.com/podcast.

Introducing MJR Analytics … and How Two Years Go So Fast When You’re Learning Something New

Today I’m excited to be launching MJR Analytics, a new consulting company focusing on modern, cloud analytics projects using technology from Looker, Qubit, Fivetran, Oracle and Snowflake and development techniques learnt from my time working as an analytics product manager at a startup in London.

a8d92-1depqrpwrgm8zvqcissuz9a.png

Our new website (and no that’s not me sitting in the chair)

So what have I been up to in the two years since I left my old consulting company, and how has that experience and the way I’ve been working with analytics technologies over that time inspired me to start another one?

Two years ago I announced on Twitter that I’d left the company I’d co-founded back in 2007 and intended to now take on a new challenge, and then spent the rest of the week at Oracle Open World cycling over the Golden Gate Bridge and coming back on the ferry and trying to decide what that challenge might actually be.

 
 

For most of my time in the IT industry I’d been working on projects implementing products from vendors such as Oracle and I’d always been interested in how these products came to market, how software vendors came up with a strategy and roadmap for those products how the team behind those products worked with the engineers who built them.

I’d also become increasingly interested in the startup world and towards the end of time time at Rittman Mead had taken-on an informal role advising Gluent, Tanel Poder and Paul Bridger’s product startup who were building software that enabled big enterprise customers to offload their data warehousing workloads from expensive proprietary databases onto to cheap, flexible storage and processing running on Hadoop clusters.

What appealed to me about working more formally with Gluent was the opportunity it gave me to work with two smart founders and an even smarter development team developing a product built entirely on big data technology I’d until then only scratched the surface with on consulting gigs. The product marketing role I took on was all about establishing what market that product was primarily intended for, how we went about positioning the product to appeal to that market and how we then brought that product to market.

Understanding these four things are crucial if you’re going to actually get customers to buy your startup’s product:

  • who is the buyer

  • what problem does your product solve

  • what is the value solving that problem, and

  • why you’re the first product to solve it for them

otherwise you’ll spend your time building a solution to a problem that nobody actually has, and that’s the reason the majority of tech startups end-up failing. Solving a problem for a market that’s willing to pay you money to solve is called “product/market fit” and if your product has it, and you’ve built your business such that it scales linearly as more customers discover your product, then you’re going to make a lot more money than a consultancy constrained by how many hours in the week your consultants can work and the upper limit on how much you can charge for a single person’s time.

I also learnt the distinction between product marketing, product management and product development in my time at Gluent. Going back to my time as a consultant attending product roadmap sessions at conferences I never quite knew which parts of the product team those speakers came from, but in summary:

  • Product Marketing is about taking a product that’s typically already built and then deciding the product’s positioning and messaging, then launching the product and ensuring the sales team, sales engineering and customers understand how it works and what it does; as such, this is a marketing role with a bit of technical evangelism thrown in

  • ProductDevelopment is the actual building of the product you’re looking to sell, and requires an engineering skillset together with the inspiration that typically came up with the product idea in the first place along and an entrepreneurial side that made you want to build a company around it

  • Product Management is more of a customer-facing role and is about understanding what your customers want and what their problems and use-cases are, and then creating a strategy, roadmap and feature definition for a product that will meet those needs

Despite my undoubted product marketing skills based around PowerPoint and internet memes:

 
7c48e-1tjqaxhjqadwgs1q_z4hanw.jpeg
 

In product marketing, it’s never too soon to put a Santa hat on a photo of the founder

in the end it I realised that it was product management that interested me the most and, after a couple of meetings with an old friend who used to run product management at Oracle for their business analytics product line and who had recently moved to London and now lead the product team team at Qubit, a technology startup created by four ex-Googlers building marketing technology products based-around Google’s big data and cloud technology, I joined their team later in 2016 as product manager responsible for the analytics features on their platform.

I spoke about the Qubit and the partnership we established with Looker back in May last year at a presentation at Looker’s JOIN 2017 conference in San Francisco and the slide deck below from that event goes into the background to the product and the problem it solves, helping customers using Qubit’s personalization platform make more effective use of the data we collected for them.

The product and data engineering teams at Qubit did an excellent job bringing together the features for this product and in hindsight, the bits I was most proud of included:

  • The business metadata layer we created on Google BigQuery and Google Cloud Platform to translate an event-level normalized many-to-many data model designed for fast data ingestion into a denormalized, dimensional data model designed for easy use with BI and ETL tools

  • Additional integration we created for the Looker BI tool including a set of industry vertical-specific Looker models and dashboards we then made available on the Looker Block Directory and in a Github public repo

bd207-1xm71rz142mmc8pxwnlqd1g.png

Screenshot from Personalization Analytics Block for Looker by Qubit

  • The multi-tenant data warehouse and hosted Looker instance we then put together to enable customers without their own Looker instance to make use of their data in Google BigQuery, doing so in a way that supported per-tenant extensions and customizations by the customer or their implementation partner.

d3090-1ligy1fezlkqsq3dvroghca.png

Technical Architecture for Live Tap as presented at Looker JOIN 2017

What I’ll take-away from my time at Qubit though was the incredible amount that I learnt about product management, product engineering, how to build and run a successful startup and team who are still highly-motivated seven years in and the introduction it gave me to the analytics and data-led world of digital marketing, eCommerce and modern data analytics platforms.

Consulting is a popular route into product management and the experience I brought to the role in areas such as business metadata models, analytical techniques and the needs of BI and ETL developers proved invaluable over the eighteen months I worked as part of Qubit’s product and engineering teams, but moving into product management within a young, technology-led startup founded by ex-Googlers and working with some of the smartest and most innovative people I’ve ever met involved learning a whole new set of skills including:

  • Developing on a new technology platform (Google Cloud Platform) within a new industry (eCommerce and digital marketing) and understanding a whole new set of analytics use-cases and customer roles (A/B testing, stats models and event-based analytics used by analysts and strategists within eCommerce businesses) that I described in a presentation at last year’s UK Oracle User Group Tech Conference in Birmingham:

  • Working as part of a team rather than directing that team, and managing -up as well as down, a technique I had to relearn pretty quickly in my first few months in the role

  • Learning to achieve my goals through influence rather than in the top-down way I’d been used to getting things done leading customer projects, and as CTO and owner of the company that team worked for

  • Saying no to customers rather than yes as you did as a consultant, as your objective is to build a product that solves the most important customer needs but doesn’t burden you with so many features addressing niche use-cases that you end up with Homer’s car and can’t innovate the product in future releases

 
 
  • How to take a product through its lifecycle from identifying a need that makes sense for your company to meet, through prototyping, alpha and beta releases to successful first launch and then creating a strategy and roadmap to manage that product over its complete lifecycle

  • How to use a new generation of modern, cloud-native data analytics tools such as Looker together with products such as FiveTran, Google Cloud Platform, Qubit, Snowflake DB and Snowplow Analytics that were increasingly also being adopted by the FinTech, MarTech and B2C startups clustering in London and other European/North American tech hubs

I learnt so much from my colleagues at Qubit about products, engineering and building a successful and motivated team that put up with my jokes and built the most technologically-advanced marketing personalization platform on the market.

But what my time at Qubit also made clear to me was that, when it came down to it, what really motivated me to get up in the morning, learn all these new technologies and still be wildly excited to come into work in the morning twenty years later was:

  • using data and analytics to find new insights and uncover new opportunities in a customer’s data set

  • working with that individual clients, over time, to enable them to find more of those insights and opportunities themselves

  • find new innovations in analytics technologies and how we deliver projects to make this process cheaper, faster and more likely to be successful

  • and building a team, and crucially a business, to do all of this at scale and offer a full set of analytics-related consulting services built around modern analytics tools and delivery techniques

Which is why after two years away from the consulting business and two enjoyable, rewarding and enlightening years working on the other side of the data and analytics industry I’m now launching my new consulting company, MJR Analytics; and I hope to be working with many of you as clients or members of our team over the coming months and years.

Date Partitioning and Table Clustering in Google BigQuery (and Looker PDTs)

Google BigQuery is a data warehousing-orientated “table-as-a-service” product from Google Cloud Platform that, like Oracle Exadata, optimizes for full-table scans rather than selective row access and stores data organized into columns, rather than rows, to align better with filtering and aggregation workloads associated with data warehousing workloads.

BigQuery charges by amount of data stored in your tables and the data you’ve read in executing SQL queries against those tables, so BI tools such as Looker that work efficiently with BigQuery only request the columns they need when querying BigQuery datasets rather than running a SELECT(*) and throwing away what’s not needed.

To illustrate this let’s run a query that requests all columns (“SELECT (*) FROM …”) from a BigQuery table, and as you can see from the screenshot below it’s reading through all 1.47GB of table data to return that full set of columns to the user.

If the users’ query only really needed just two of those columns, requesting just those brings down the amount of data read to just under 10MB as most of of that table’s data is stored in columns of data that aren’t needed for the query.

BigQuery historically has supported table partitioning based on the date you loaded data into the table which sounds great until you realise that it’s the transaction date, not the ingest date, that most of your user’s queries filter against.

You could also use table decorators in Legacy SQL to point to the particular day partition your data was stored within but this only went back for a maximum of seven days and required your query tool to support this non-standard feature of SQL; earlier in this year though Google introduced a more flexible form of date partitioning as a beta release feature that allows you to choose the date column your table would be partitioned by, and more recently introduced a feature called table clustering that stores data within a table sorted by the columns you most commonly filter on when running queries against it.


To show how date partitioning and table clustering work, I’ll start by running a query to return just a month’s data from the five years of data held within my table; as you can see in the screenshot below, BigQuery performs a full table scan and reads through all 1.37 GB of data in the table to return just the month of data my query requested.

Standard SQL now supports DDL commands such as CREATE TABLE and CREATE TABLE … AS SELECT, along with a PARTITION BY clause that lets you specify a timestamp or date column to partition the table by. I’ll use these new features to create a copy of that same table, this time partitioned by the timestamp column I’m filtering on in my query predicate …

… and the DDL statement fails. What’s happened there then?

Turns out that BigQuery tables are limited to 2500 partitions for the moment, with any one single load operation limited to 2000 (hence the error) and with partitioning limited to just date and timestamp columns and partitions a day in length it means any one table can only hold around five years worth of data, beyond that you’ll need to create multiple date partitioned tables and UNION them together through a SQL view.

For now though I load my table with just five years of data and then re-run the query that requests a single day from that overall five years; now BigQuery has only read and processed 57 MB of data and it’d be a fraction of that if I only requested the two columns I needed, not all columns from that table.

But what about queries that filter against the other columns in this table? We can’t set up table partitioning on STRING, INTEGER or any other type of column datatype so my original query if re-run against the date partitioned table reads just as much data as it did before.

What we could do is re-create the table with its data pre-sorted by those two particular columns using another new feature called table clustering, so that queries that filter against those columns find the blocks of data they’re looking for faster and can skip completely the ones that don’t.

If like me you’re more familiar with databases such as Oracle, “table clustering” is all about storing data from tables sharing a common cluster key together in the same data blocks, so that queries against that group of tables filtering on that key return data faster.

Table clustering in BigQuery is more analogous to loading regular Oracle tables using data from a pre-sorted file and comes with the same benefits and limitations; in BigQuery’s case it takes care of the pre-sorting and table configuration for you but the same limitations still apply around how you filter the table and what happens when you load more data afterwards.


Let’s set-up a clustered table now that stores its data ordered by the two columns used in the query I ran a few moments ago.

Now when I run a query filtering on those columns against this partitioned, clustered table the amount of data read goes down compared to before, and results are returned a bit faster; if I included the partition key column in the query as well, returning just a few days’ data, it’d be faster still.

But queries that filter using any of the other columns don’t benefit from this clustering and moreover, the benefits of sorting the data initially loaded into a clustered table are lost over time as you load more (unsorted) data into it, meaning that to really benefit from clustering a table you have to rebuild it regularly using a CREATE TABLE … AS SELECT.

Table clustering in BigQuery is nice to have but similar to pre-sorting a direct path load into Oracle database tables, it’ll take a lot of rebuilding and careful querying to get the most benefit from and with the size of most BigQuery tables, I doubt that rebuilding will happen much in-practice.


BI tools such as Looker can make use of table partitioning and clustering in queries straight away as no changes are required in the query SQL you write, everything is handled in the table definition. Where you might want to set up partitioning yourself as a Looker developer is for the persistent derived tables (PDTs) that Looker can create to materialize the results of a view you define in your Looker model to derive a dimension attribute using a subquery, for example to calculate the sequence number of a users’ order for retention analysis for an eCommerce dashboard as shown in the screenshot below.

Looker has for some time come with database-specific settings for particular source database types such those used for Redshift used in the example above, and now supports date partitioning for persistent derived tables through a new partition_keys setting as announced in this recent Looker forum post.

Finally, if you’re interested in how partitioning is developing as a feature within BigQuery, and some of the edge-cases and requests for additional functionality that users of partitioning are asking for, this feature request on the BigQuery issue tracker is useful to read-through.

Extending support to more than 2500 partitions seems to be the most popular request along with allowing integer and string datatype columns to be used for the partition key, but also look out for issues around re-loading data into historic partitions and the cost and work involved in rebuilding large tables to re-cluster its data or apply partitioning for the first time.

Using Looker Data Actions to Make Monzo Spend Analysis More Interactive … and Actionable

The other week I posted a blog about analyzing the bank transaction data Monzo make available to customers via their developer API using Looker, the BI tool I use day-to-day in my role at Qubit as Analytics Product Manager.

One of the dashboards I’ve created shows me how spend looks across a number of merchant categories, compares their average transaction values and higlights the one I visit the most (The Ginger Dog in Brighton, thoroughly recommended) and plots each of the them on a map so I can remember where they’re each located.

If I click on one of the map icons on the Spend by Location visualization Looker then pops-up a dialog showing me the individual column values for this particular merchant — the Hardwood Arms in Fulham, one visit and total/average spend amounts of £75 exactly.

As well as tagging your transactions with metadata from the Foursquare web service Monzo also adds the Twitter handle for each merchant along with their website URL; I’ll start to make this dashboard a bit more interactive then by adding these as link parameters to my merchant_name dimension definition in my LookML view definition.

dimension: merchant_name {
group_label: "Monzo Transactions"
label: "Merchant Name"
type: string
sql: ${TABLE}.merchant_name ;;

link: {label:"Merchant Website"
url:"https://{{ fluentd_monzo.merchant_metadata_website._value }}"
icon_url: "https://www.google.com/s2/favicons?domain=google.com"
}
link: {label:"Merchant Twitter Profile"
url:"https://twitter.com/{{ fluentd_monzo.merchant_metadata_twitter_id._value }}"
icon_url: "https://www.google.com/s2/favicons?domain=twitter.com"
}

Monzo also provides Google Places API and Foursquare API merchant metadata lookup codes along with each transaction record, so I add two more link parameters to call these two web services.

link: {label:"View in Google Maps"
url: "https://www.google.com/maps/search/?api=1&query={{ fluentd_monzo.merchant_address_latitude._value }},{{ fluentd_monzo.merchant_address_longitude._value }}&query_place_id={{ fluentd_monzo.merchant_metadata_google_places_id._value }}"
icon_url: "https://www.google.com/s2/favicons?domain=maps.google.com"
}
link: {label:"View in Foursquare"
url: "{{ fluentd_monzo.merchant_metadata_foursquare_website._value }}"
icon_url: "https://www.google.com/s2/favicons?domain=foursquare.com"
}

Now when I click on the name of a merchant within any data visualization using that data point I get a menu of links I can click on for more information.

Clicking on the View in Google Maps link passes across the merchant’s latitude, longitude and Google Places API key to the Google Maps web service, and Looker then opens a new browser tab and gives me all the information I’m looking for using the Google Places web service.

All of this is good to know but it’s only really of value if I can make this data actionable, for example by linking it to the Google Tasks web service I use to record personal, non-project related tasks and reminders. To do this I’m going to use Looker’s Data Actions feature together with Zapier’s integration with Google Tasks to tie it all together.

First, I create a Zapier Zap that listens for a webhook call and then uses the parameter values I provide from the Looker data action to to create a new Google Task. This article on the Looker Discourse website goes into more detail on how to set up Zapier webhooks integration with data actions if you’d like to set up something similar yourself.

Then I add an action parameter into my merchant LookML dimension definition that defines a form with default values, allows me to amend those values if needed and then HTTP POSTs them as the payload to the Zapier web service.

action: {
label: "Add Google Task"
url: "https://hooks.zapier.com/hooks/catch/XXXXXX/fco8nm/"
icon_url: "https://www.google.com/s2/favicons?domain=tasks.google.com"
form_param: {
name: "Task Name"
type: string
required: yes
default: "{{ merchant_name }}"
}
form_param: {
name: "Due Date"
type: string
required: yes
}
form_param: {
name: "Action"
type: textarea
default: "Book table at {{ merchant_name }}"
}
form_param: {
name: "Urgency"
type: select
option: {
name: "Low"
label: "Low"
}
option: {
name: "High"
label: "High"
}}}

Now, another menu link item is displayed to allow me to create Google Task items when I click on merchants in a Monzo transactions dashboard, like this:

When I click on that link a dialog is then shown, giving me the option to amend the task title and details and add a due date and urgency to the task I’m about to define.

Then finally, I press Submit and then go over to the Google Tasks web interface to see my new task recorded and ready for me to action tomorrow morning.

Connecting Looker to Oracle Autonomous Data Warehouse Cloud

Earlier in the week I wrote-up my first impressions of Oracle Autonomous Data Warehouse Cloud (ADWC) on this blog, and said at the end I’d follow with another post on how to connect ADWC to Looker, the cloud-based BI tool I usually query Google BigQuery with in my role as Analytics Product Manager at Qubit.

For this initial example I uploaded a few tables of workout and other data into ADWC using SQL Developer, exporting the relevant tables out of BigQuery in CSV format and then uploading them into a user account I created for the purpose in AWDC. The table I’ll concentrate on for this example is the STRAVA_RIDES table, each row detailing an individual cycle workout as recorded through the Strava smartphone app.

The first step in connecting-up Looker to ADWC, or indeed any Oracle Database, is to run some scripts that set up Oracle as Looker expects to find it. Unlike most other BI tools I’ve used with Oracle, Looker expects to connect through a specific user login (“LOOKER”) that is then granted SELECT access to any tables and views you want to report on in other schemas. This user login also needs to have some views and synonyms created to give it access to the V$ system views within Oracle that report on active sessions, and an ability to kill long-running sessions through a PL/SQL package that calls ALTER SYSTEM … KILL SESSION.

The commands to run for regular Oracle databases are detailed on the Looker website but you need to alter them slightly to use ADWC’s superuser account name (“ADMIN”) instead of SYS when initially connecting and when creating the LOOKER_HASH function for symmetric aggregate handling, along with various other changes due to differences in how various objects are named in AWDS vs. regular Oracle Database.

I’ve listed the commands I ran on my ADWC instance below, they should work for you but if not then check out the “Autonomous Data Warehouse Cloud for Experienced Oracle Database Users” section in Using Autonomous Data Warehouse Cloud that explains the differences between the new autonomous and regular Oracle Database server versions.

First create the Looker account and grant the relevant roles and priviledges:

connect ADMIN/<<your_admin_password>>
create user LOOKER identified by <<new_looker_account_password>>;
alter user LOOKER
default tablespace DATA
temporary tablespace TEMP
account unlock;

alter user LOOKER quota unlimited on DATA;
alter user LOOKER default role RESOURCE;
grant CREATE SESSION to LOOKER;
GRANT UNLIMITED TABLESPACE TO LOOKER; 
GRANT CREATE TABLE TO LOOKER;
grant select on -- <all tables that will be used by looker>;

Now create the views that Looker uses to understand what sessions are active and the SQL that’s currently being executed to provide data for looks and dashboard tiles:

create or replace view LOOKER_SQL 
as
select SQL.SQL_ID, SQL.SQL_TEXT
from V$SQL sql ,v$session sess
where SESS.SQL_ADDRESS = SQL.ADDRESS
and SESS.USERNAME='LOOKER';
create or replace synonym LOOKER.LOOKER_SQL for LOOKER_SQL;
grant select ON LOOKER.LOOKER_SQL to LOOKER;
create or replace view LOOKER_SESSION as 
SELECT SID, USERNAME, TYPE, STATUS, SQL_ID, "SERIAL#", AUDSID
FROM V$SESSION
WHERE USERNAME='LOOKER';
create or replace synonym LOOKER.LOOKER_SESSION FOR LOOKER_SESSION;
GRANT SELECT ON LOOKER.LOOKER_SESSION TO LOOKER;

Next, create the Oracle PL/SQL function that Looker uses as part of symmetric aggregate handling, and a function that Looker can use to “kill” runaway database queries that are taking too long to return results back to you.

create or replace function LOOKER_HASH(bytes raw, prec number)   return raw as   
begin
return(DBMS_CRYPTO.HASH(bytes, prec));
end;
create or replace synonym LOOKER.LOOKER_HASH for LOOKER_HASH;
grant execute on LOOKER.LOOKER_HASH to LOOKER;  
grant execute on ADMIN.LOOKER_HASH to LOOKER;
create or replace procedure LOOKER_KILL_QUERY(P_SID in VARCHAR2,
P_SERIAL# in VARCHAR2)
is
CURSOR_NAME pls_integer default dbms_sql.open_cursor;
IGNORE pls_integer;
begin
select COUNT(*) into IGNORE
from V$SESSION
where USERNAME = USER
and SID = P_SID
and SERIAL# = P_SERIAL#;
if (IGNORE = 1)
then
dbms_sql.parse(CURSOR_NAME,
'alter system kill session '''
|| P_SID || ',' || P_SERIAL# || '''',
dbms_sql.native);
IGNORE := dbms_sql.execute(CURSOR_NAME);
else
raise_application_error(-20001,
'You do not own session ''' ||
P_SID || ',' || P_SERIAL# ||
'''');
end if;
end;
create or replace synonym LOOKER.LOOKER_KILL_QUERY 
for ADMIN.LOOKER_KILL_QUERY;
grant execute on ADMIN.LOOKER_KILL_QUERY to LOOKER;

Next over to the Looker configuration. You’ll need to be on the Looker 5.12.12 or higher release with an instance hosted in the US to get the integration working as of the time of writing so that “ADWC” is listed as a connection type and the ADWC wallet integration works; if you’re running Looker as a hosted instance you’ll also need to speak with support to have them copy across the wallet files to the correct location on the Looker server.

To create the connection, enter the following details:

  • Name : Name of your connection, e.g. “rittman_adwc”
  • Dialect : Oracle ADWC (only appears with Looker 5.12.12+)
  • Host:Port : from the TNSNAMES.ORA file in your ADWC wallet zip file
  • Username : LOOKER (as per the account setup in previous steps)
  • Password : password of LOOKER account
  • Temp Database : LOOKER (as per previous steps)
  • Persistent Derived Tables : checked
  • Service Name : From TNSNAMES.ORA in your ADWC wallet zip file
  • Additional Database Params : TNSNAMES.ORA SSL Server Cert DN

To show the Service Name and Additional Database Params fields you first have to save the connection, then tick the “Use TNS” checkbox to reveal the fields. To find your host:port, service name and SSL Server Cert DN values first download the wallet zip file for your ADWC instance from the ADWC Service Console, unzip the archive and then locate the details you need in the TNSNAMES.ORA file as shown below. In my case I chose to use the “medium” ADWC instance type for my connection settings.

Then, save and test your connection. The step that checks that persistent derived tables will probably fail if you try this around the time of my writing as there’s a known bug in the step that checks this feature, it’ll no doubt be fixed soon but if the rest of the checks pass you should be good.

Finally, it’s just then a case of importing your table metadata into Looker and creating explores and a model as you’d do with any other data source, like this:

In this instance I’ve updated the Strava Rides LookML view to turn the relevant metric fields into measures, define a primary key for the view and remove or hide fields that aren’t relevant to my analysis, like this:

Now I can start to analyze my Strava workout data I previously uploaded to Oracle Autonomous Data Warehouse, starting with average cadence and speed along with total distance and Strava’s “suffer score” for each of my workouts:

and then looking to see how much correlation there is between distance and personal strava records being broken on my five longest rides.

In the background, Looker is sending Oracle and ADWC-specific SQL to Oracle Autonomous Data Warehouse Cloud, with the SQL tab in the Explore interface showing me the actual SQL for each query as its sent.

Should I wish to check how much of the storage and CPU capacity available for my ADWC instance is being used, I can do this from ADWC’s Service Console.

So there you have it — Looker powered by an Oracle Autonomous Data Warehouse Cloud, and no need for an Oracle DBA to get it all running for you.

Timeline Charts, Derived Tables and Analytic Functions in Looker 5

One of the data streams that comes into my personal Google BigQuery data warehouse is a record of all the Hue light switch settings and illuminance levels, Nest thermostat settings and motion sensor readings coming in via the Samsung Smart Things hub under the stairs back home. For the Hue lightbulbs I get a stream 1s and 0s that initially get stored in a four-column table in BigQuery, like this:

I then create a SQL view over that table that joins to a lookup table containing device types and room allocated for each of the devices in the data stream, and sets a metric type for each recording so that I can analyze all the thermostat readings together, bring together all the light switch settings and so on.

Its then quite straightforward to bring this SQL view into Looker and add it to the rest of the data streams I’ve brought together there so I can see, for example, whether the kitchen lights get turned-off after everyone’s gone upstairs after breakfast or whether I should try and automate those as well as my kettle.

Looker 5 introduced another way to visualize data with defined start and end events in the form of the new Timeline Chart, a feature I tweeted about the other day when I brought data on tasks from Asana into my BigQuery data warehouse via a Fivetran integration.

It’d be interesting to use this type of chart to show how one particular light switch changed from on to off and then on again over a period of time, or take all the Hue lights for a given part of the house and see whether anybody turns the lights off ever … or whether they need to be linked to the motion sensors and automatically turned-off after a period of inactivity in each room.

To use this type of chart I first need to create an “end date” field for each of my event records, so each timeline has a start and end date with other fields being optional for label and value. Looking at the list of events for the light switch I looked at earlier and ordering the events by time recorded you can see the information I need is there:

SELECT
*
FROM
`aerial-vehicle-148023.personal_metrics.smartthings_readings`
WHERE
device = 'kitchen seating area spot 1'
AND raw_value IN ('off','on')
ORDER BY
date_time

What this calls for is the LEAD BigQuery Standard SQL analytic function that returns a value based on a subsequent row to the one we’re working with, and I can combine that with the TIMESTAMP_DIFF function to compute the time between the start and end event dates as that would be useful to know too. The SQL expressions for the two new fields would therefore be:

LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds

I could add these column expressions into the SQL view definition in BigQuery and order the rows by device, metric and timestamp, like this:

SELECT
date_time as date_time,
device as device,
metric,
value,
raw_value,
LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) as end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds
from `aerial-vehicle-148023.personal_metrics.smartthings_readings`
where device = 'kitchen seating area spot 1'
and metric = 'switch'
and raw_value in ('on','off')
order by 2,3,1

Executing that SQL shows the logic is working, but I’d then have to maintain that view within BigQuery and that might not be the most convenient place to add new code.

Instead, I could just go into Looker and create a new view there based on a derived table SQL expression and do the work there. Right now my LookML model looks like the excerpt below, where you can see the sensor readings view joined into the rest of the explore so all my data can be analyzed together.

connection: "rittman_bigquery"
include: "*.view.lkml" # include all views in this project
include: "*.dashboard.lookml" # include all dashboards in this project
explore: date_dim {
case_sensitive: no
label: "Data Warehouse"
join: fluentd_transactions {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_transactions.date_minute5} ;;
relationship: many_to_many
}
join: smartthings_readings {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${smartthings_readings.date_minute5} ;;
relationship: one_to_many
}
join: fluentd_uber_rides {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_uber_rides.date_minute5} ;;
relationship: many_to_many
}
}

I now create a new LookML view that uses a derived table SQL query as the view definition rather than simply referencing an existing BigQuery table. Note how I’ve used the same view_label as the main LookML view containing my event data, so that the dimensions and metric I define here appear alongside all the other smart device fields the same explore view.

view: device_event_end_and_timespan {
view_label: "6 - Smart Devices"
derived_table: {
sql: SELECT
date_time as date_time,
device as device,
metric as metric,
LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) as end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds
from `aerial-vehicle-148023.personal_metrics.smartthings_readings`
order by 2,3,1;;
}
dimension: date_time {
type: date_time
hidden: yes
sql: ${TABLE}.date_time ;;
}
dimension: device {
type: string
hidden: yes
sql: ${TABLE}.device ;;
}
dimension: metric {
type: string
hidden: yes
sql: ${TABLE}.metric ;;
}
dimension_group: end_date_time {
group_label: "End Date"
label: "End"
type: time
timeframes: [
raw,
time,
hour,
hour3,
hour4,
hour6,
hour12,
hour_of_day,
time_of_day,
minute,
minute5,
minute10,
minute15,
minute30,
day_of_week,
day_of_month,
day_of_year,
date,
week,
week_of_year,
month,
month_name,
month_num,
quarter,
year
]
sql: ${TABLE}.end_date_time ;;
}
dimension: pk {
primary_key: yes
hidden: yes
type: string
sql: concat(cast(${TABLE}.start_date_time as string), ${TABLE}.device) ;;
}
measure: value_duration_seconds {
type: average
sql: ${TABLE}.value_duration_seconds ;;
}
}

Then I join this derived table view back into the explore within my LookML model, as below:

connection: "rittman_bigquery"
include: "*.view.lkml" # include all views in this project
include: "*.dashboard.lookml" # include all dashboards in this project
explore: date_dim {
case_sensitive: no
label: "Data Warehouse"
join: fluentd_transactions {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_transactions.date_minute5} ;;
relationship: many_to_many
}
join: smartthings_readings {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${smartthings_readings.date_minute5} ;;
relationship: one_to_many
}
join: device_event_end_and_timespan {
type: left_outer
sql_on: ${smartthings_readings.date_time} = ${device_event_end_and_timespan.date_time} and
${smartthings_readings.device} = ${device_event_end_and_timespan.device} and
${smartthings_readings.metric} = ${device_event_end_and_timespan.metric};;
relationship: one_to_one
}
join: fluentd_uber_rides {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_uber_rides.date_minute5} ;;
relationship: many_to_many
}

and now I can create timeline charts that show me which lights were on over a particular period of time, like this:

or more importantly, work out why the bathroom never seems to be free when you’ve got two teenage kids in the house.

Monzo, BigQuery, Looker and “Fintech”: The Other London Tech Startup Scene

I’ve written a number of blogs over the past few months about bringing my fitness, smart devices and other digital data sources into Google BigQuery and Looker to do interesting analysis with the combined dataset, whilst also keep my tech skills up to date and relevant for my day job as Product Manager for Analytics at Qubit, a London tech startup in the hot new MarTech (“Marketing Technology”) space.

At Qubit we also use Google BigQuery and Looker to enable our customers to run fast, interactive queries against petabytes of consumer behavioural data whilst BigQuery abstracts away the hassles in running a production data warehouse, whilst Looker abstracts away the complexities of SQL, joining datasets together and trying to visualize it all in D3 or RStudio.

One big gap in the data I’d managed to collect up until recently was anything to do with what I actually spent each day; whilst my current bank along with most others provides a means to manually download statement data in Excel or CSV format, or supports commercial services that load that data into your personal finance software for a monthly fee, I wanted to access my data in the same way that I accessed my health, timekeeping and cycling workout data … through a REST API, cURL and a cron job that loads new data into BigQuery on the hour every day.

So that’s why I smiled when I heard about Monzo, another tech startup just a mile or so from my office in London that’s disrupting another industry: consumer banking, with Monzo, Starling and other new app-only banks taking advantage of new regulations such as the EU’s new Payment Services Directive (PSD2) that open up existing bank accounts to authorised third-parties looking to provide a better user experience (UX) over your existing bank arrangements; or as in Monzo’s case, actually obtaining a banking license and offering a complete end-to-end service that focuses on solving customer problems rather than you providing a cheap source of money to lend for the bank.

Along with other technology-led startups working in the financial services space Monzo are the best example yet of a compelling new financial services company with a strong appeal to consumers more familiar today with Facebook than chequebooks.

Image Courtesy of Medici, “Awe-Inspiring London FinTech Industry Is Firing on All Cylinders”

Me, I was more interested in the fact that Monzo also use Looker and Google BigQuery to run their analytics infrastructure, and that they also offer a developer-orientated REST API that provides exactly what I’ve been looking for to enable myself and increasingly others in the developer community to download, analyze and visualize their spending data and start to build applications that when the full Monzo API comes out will revolutionize how we do our banking in the coming years.


To start analyzing my banking data through Monzo I first needed to do a historic batch download of all my historic transactions and top-ups and put that into a BigQuery table, and then setup a trigger on Monzo’s systems that sends across all the subsequent transactions as they happen to keep my data in-sync with Monzo’s record of my bank account.

To set up the initial transaction history transfer I first registered at https://developers.monzo.com/ and then generated an access token for my original pre-paid Monzo card and the current account card that recently replaced it. Note that this access token only lasts for a short time so you can’t generate it just once and use it in a script forever, and more importantly the developer API is meant just for personal use and can’t be used to access other people’s accounts, so you can’t build one of the new apps made possible by the UK’s new open banking regulations just yet (but this is coming soon through their interim AIS API now in testing with a small number of third-party developers)

To find out the account IDs for your Monzo accounts either use the API playground web app as I’ve done in the screenshot above, or use the Developer API now for the first time to get them yourself, via the cURL tool:

curl -G “https://api.monzo.com/accounts" -H “Authorization: Bearer <<YOUR_ACCESS_TOKEN_HERE>>" ./accounts.json

and then retrieve your account IDs from the JSON output through a tool such as jsonpp (brew install jsonpp if you’re on a Mac, similar tools for other OSs)

{
"accounts": [
{
"id": "acc_<<MY_PREPAID_CARD_ID>>"
"created": "2017-04-06T07:07:15.175Z",
"description": "Mark Rittman",
"type": "uk_prepaid"
},
{
"id": "acc_<<MY_CURRENT_ACCOUNT_ID>>"
"created": "2017-11-28T08:35:28.323Z",
"description": "Mark Robert Rittman",
"account_number": "<<MY_AC_NUMBER>",
"sort_code": "<<MY_SORTCODE>>",
"type": "uk_retail"
}
]
}

Then you can retrieve all your account transactions with another cURL request, like this one for my current account:

curl -G "https://api.monzo.com/transactions" -d account_id=acc_<<MY_CURRENT_ACCOUNT_ID>> -H "Authorization: Bearer <<YOUR_ACCESS_TOKEN_HERE>>" > /transactions.json

and then you can use tools such as “sed” (to strip out the transactions: []) array from the json output) and one of the many json2csv converters out there (for example this one that has npm as a prerequisite, something I’m guessing most people attempting this sort of project will have already, if not then first install npm via brew and you’re ready)

sed -i -e 's/{"transactions"://g' ./transactions.json
sed -i -e 's/]}/]/g' ./transactions.json
brew install json2csv
json2csv -i ./transactions.json -f created,amount ,description,currency,category,include_in_spending,<<ANY_OTHER_REQUIRED_FIELDS>> -o transactions.csv

There are many more fields available to extract from the JSON documented downloaded via the developer API depending on whether you’re accessing a Monzo prepaid card or current account; the developer docs on the Monzo site go through some of them but you’re best going through a handful of transactions to see what’s available and then create your own custom -f extract list.

Then as the final step I use the bq command-line tool to load the data in the CSV file output in the previous step into a BigQuery table, like this:

bq load --source_format=CSV --replace=true --skip_leading_rows=1 --autodetect dw_staging.monzo_transactions_stg ./transactions.csv

and then go into the BigQuery Web UI to check my transaction data has come in as expected.

In-practice for my analysis exercise I brought in a lot more fields including merchant name and map X/Y location, Foursquare metadata and other fields that are only available for the current account product, so check through the JSON elements available for your particular Monzo product and extract the ones that most interest you.


So what about capturing transactions after this date, how do we do that without some complicated process that downloads all the transactions again and then applies just the ones that are new to the BigQuery table? The way I bring in all my other app, smart device and other event data into BigQuery is to either post them directly to a fluentd server running on Google Compute Engine or use services such as IFTTT and Zapier that interface with my apps and then post to fluentd using outgoing webhook services, or their “Maker Channel” as IFTTT used to call their service when I created for a blog post back in 2016.

The Monzo Developer API also comes with a webhook feature that lets you register a URL, for example the one use when posting all my other event data to fluentd, with all subsequent transaction data then being posted to that server URL and then onto the various BigQuery tables that store my data for analysis.

In the case of the Monzo transactions I added an intermediate step and had the Monzo webhook post to a Zapier “zap” that made it easier to decode the nested JSON elements Monzo was sending over that held merchant and other payment processor-related information.

Zapier then sends over just those elements of the transaction I’m interested in to the fluentd server which then streams them into BigQuery using the fluentd BigQuery plugin. From there I can then write SQL to analyze this data in BigQuery’s web UI, remembering to divide the transaction amount by 100 as they’re all stored as pence in integer datatypes within the Monzo system, and to remove any top-ups, declines and and other non-spend transactions from the results by filtering on the “include_in_spending” column to remove all records with “false” as the predicate value.


Analyzing the data in a tool such as Looker makes things much more interesting as I can start to make use of the locational metadata for merchants that Monzo provide for current account transactions, showing me for example where I took out cash from ATMs on the way back home from the pub and then forgot I’d done so the morning after.

Or I can see how my spend on essentials like entertainment and eating out rises in the summer, especially when I appear to do away with non-essentials like food and other groceries.

I can see where it was I spend the most money when out on cycling trips in the summer, mostly in pubs around the New Forest and up near Bath, Frome and Warminster (and how I wish winter would end and I can get back out on my bike again).

I can even work out how my choice of restaurant and entertainment changed from lots of spend on music in August through to lots of eating out in October.

And more importantly, I can see I’m spending far too much money on coffees and breakfast on the way into work each day, but that’s the downside I guess of working just off Covent Garden in what must be one of the nicest work locations I’ve ever had the privilege to base myself out of.

The beauty of Looker is that I can define a data model in LookML that takes table structures and SQL expressions and turns them into more user-friendly measure and dimension definitions, like this:

These LookML “views” can then be joined to other views that share a common join key — in my case, the truncated “date” part of a date and time timestamp field — that then makes it possible for me to analyze spend data in the context of where I was on that day and that time; whether I was out on the bike and if so whether I’d had a good workout or not so good; what the weather was like on the day and what I’d been chatting about on Slack at the time; whether the spend was for an Uber ride and if so where to.

Analysis that shows me, now I’m finally able to join my financial and activity dots together for the first time, where all my money’s gone since the nights got darker and I’m clearly in the pub rather than out on my bike. Lucky spring is around the corner or I’ll be applying for Monzo’s new overdraft service that’s just gone into review rather than registering new Strava personal records next month.