Posts tagged Analytics
Event-Level Digital Analytics using Google Analytics, Fivetran, BigQuery and Looker

A few weeks ago I posted a blog on using Mixpanel, Looker and Google BigQuery to analyze listener data from the Drill to Detail website, with Mixpanel tracking individual episode play events by visitors to the site and Fivetran replicating that event data over to BigQuery for analysis using Looker.

Mixpanel is great but like most digital businesses using Google Analytics (GA) to analyze and measure activity on their website, we actually use Google Tag Manager and GA events to track visitor activity on the mjr-analytics.com website. Whilst a full site license for Google Analytics 360 is way beyond our budget right now, the free version of GA together with GTM provides us with an unsampled, event-level stream of visitor events and together with goals and simple A/B testing tools a way to start optimizing the content on our site. 

Couple all of this with Fivetran and Looker and we can start to do some interesting analytics on all this data to not only count sessions, visitors and pages viewed but also start looking at visitor retention and the routes those visitors take through the various features on our website.

Events in Google Analytics are a way of recording interactions visitors to your site make with items on a page such as selecting an item on a navigation menu, clicking on a download link or seeing a special offer or product recommendation. On the mjr-analytics.com website we’ve setup events, for example, to record how far down the page a visitor scrolls so we can see how many people actually scroll past the hero image on our front page and look at the content on Looker services, Oracle services and so on.

Untitled 2.001.png

Another event records clicks on the specific hero image a visitor clicks on with another recording what navigation menu item was clicked on, in each case storing the image name or menu item clicked on as the event label.

Untitled 2.002.png

Whilst you can record events being triggered by adding custom Javascript to your website’s HTML page definitions, we use GTM’s graphical point-and-click interface to set-up our events and the visitor actions that trigger them.

Untitled 2.005.png

Enabling GTM’s debug mode for our site shows us which events then trigger when visiting our site, with the screenshot below showing a typical visitor journey through our homepage and the events that are triggered as they scroll-through and interact with various items on the page.

Untitled 2.003.png

Data on events being triggered along with event label and event action values are stored in a Javascript data layer that can also receive values sent over, for example, by an eCommerce application so that clicks on a product catalog item can record the price on offer as well as what product was clicked on. 

Looking at the Data Layer view in Google Tag Manager you can see the event entries recorded for this visitor to the site page, but note also how nothing personally identifiable is recorded except for a unique client ID set at the device/browser level that can then be used to analyze new vs. returning visitors and track visitor retention and engagement over time.

Untitled 2.004.png

As we then did with the Mixpanel example and the Drill to Detail website, we then use Fivetran to replicate all of the data layer events over to Google BigQuery, storing this granular visitor behavioural data in a BigQuery table and dataset. As you can see from the screenshot below we use Fivetran extensively within MJR Analytics; replicating invoices and all our other accounting data from Xero into BigQuery along with timesheet data from Harvest, CRM data from Hubspot, commits and PRs from Github, tasks outstanding and completed from Asana and search rankings and keywords from Google Search Console all into a Google BigQuery project whose datasets we then combine and analyze together using Looker.

Untitled 2.006.png

For the event data coming across from GA we use Fivetran to replicate the relevant event tracking dimensions and metrics into BigQuery every five minutes, giving us near real-time numbers on what’s popular and what’s engaging on our company website.

Untitled 2.007.png

As the event data from GA arrives as table rows of individual events being triggered, I then sessionize those individual events into rows of visitor sessions, using BigQuery SQL to pivot the incoming data and create individual columns for the first event, second event, first page view and first product item clicked on, for example, within an individual visitor session. The actual SQL used for the view is too long really to include here but to get the idea, here’s a condensed version with repeating parts removed for clarity.

SELECT
  global_session_id,
  visitor_session_id,
  visitor_id,
  visitor_session_start_date_time,
  MAX(visitor_session_length_mins) AS visitor_session_length_mins,
  visitor_first_event_week_cohort,
  city,
  session_events_count,
  ...
  event_1_category,
  event_1_label,
  ...
  page_view_1,
  page_view_2,
  ...
FROM (
  SELECT
    e.global_session_id,
    e.visitor_session_id,
    e.visitor_id,
    e.visitor_session_start_date_time,
    e.visitor_session_length_mins,
    e.visitor_first_event_week_cohort,
    e.city,
    MAX(e.visitor_session_event_seq_num) OVER (PARTITION BY e.global_session_id) AS session_events_count,
    (
    SELECT
      COUNT(*)
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      global_session_id = e.global_session_id
      AND event_category = 'Page View') AS page_view_count,
    SELECT
      COUNT(*)
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      global_session_id = e.global_session_id
      AND event_category = 'Nav Bar Clicks') AS nav_bar_click_count,
    ...,
    e1.event_label AS event_1_label,
    e2.event_category AS event_2_category,
    e2.event_label AS event_2_label,
    ...
    pv.event_label AS page_view_1,
    pv2.event_label AS page_view_2,
    ...,
    igc.event_label AS index_gallery_first_click
  FROM
    `aerial-vehicle-148023.dw_staging.ga_events` e
  LEFT OUTER JOIN (
    SELECT
      global_session_id,
      event_label,
      ROW_NUMBER() OVER (PARTITION BY global_session_id ORDER BY visitor_session_event_seq_num) page_view_seq_num
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      event_category = 'Page View'
    ORDER BY
      3) pv
  ON
    e.global_session_id = pv.global_session_id
    AND pv.page_view_seq_num = 1
  LEFT OUTER JOIN (
    SELECT
      global_session_id,
      event_label,
      ROW_NUMBER() OVER (PARTITION BY global_session_id ORDER BY visitor_session_event_seq_num) page_view_seq_num
    FROM
      `aerial-vehicle-148023.dw_staging.ga_events`
    WHERE
      event_category = 'Page View'
    ORDER BY
      3) pv2
  ON
    e.global_session_id = pv2.global_session_id
    AND pv2.page_view_seq_num = 2
  ...
  JOIN
    `aerial-vehicle-148023.dw_staging.ga_events` e1
  ON
    e.global_session_id = e1.global_session_id
    AND e1.visitor_event_seq_num = 1
  LEFT OUTER JOIN
    `aerial-vehicle-148023.dw_staging.ga_events` e2
  ON
    e.global_session_id = e2.global_session_id
    AND e2.visitor_event_seq_num = 2
  ...
GROUP BY
  1,
  2,
  ...,
  46
ORDER BY
  1

Creating a view or materialized table from this BigQuery SQL then gives me one record per session with event totals, page view and event paths along with session and client IDs that are perfect then for bringing into Looker for further analysis.

Untitled 2.008.png

Then finally it’s just a case of bringing that table of data into Looker as a LookML view and using one of Looker’s custom visualization blocks to display visitor event paths as a Sankey diagram, for example.

Untitled 2.009.png

Or we can aggregate and analyze counts of events from sessions to see in graphical form which offers, download items and navigation menu items are most popular over a given period.

Untitled 2.010.png

If you’d like to understand more about how MJR Analytics can help you better understand what visitors are looking at and interacting with on your website then drop me an email at mark.rittman@mjr-analytics.com; or even better, if you’re in Copenhagen on November 22nd and have the evening free, come along to the free meetup we’re hosting together with our partners CIMA and Looker and you’ll see me demo this all live, in person and show you how it works.

Introducing MJR Analytics … and How Two Years Go So Fast When You’re Learning Something New

Today I’m excited to be launching MJR Analytics, a new consulting company focusing on modern, cloud analytics projects using technology from Looker, Qubit, Fivetran, Oracle and Snowflake and development techniques learnt from my time working as an analytics product manager at a startup in London.

a8d92-1depqrpwrgm8zvqcissuz9a.png

Our new website (and no that’s not me sitting in the chair)

So what have I been up to in the two years since I left my old consulting company, and how has that experience and the way I’ve been working with analytics technologies over that time inspired me to start another one?

Two years ago I announced on Twitter that I’d left the company I’d co-founded back in 2007 and intended to now take on a new challenge, and then spent the rest of the week at Oracle Open World cycling over the Golden Gate Bridge and coming back on the ferry and trying to decide what that challenge might actually be.

 
 

For most of my time in the IT industry I’d been working on projects implementing products from vendors such as Oracle and I’d always been interested in how these products came to market, how software vendors came up with a strategy and roadmap for those products how the team behind those products worked with the engineers who built them.

I’d also become increasingly interested in the startup world and towards the end of time time at Rittman Mead had taken-on an informal role advising Gluent, Tanel Poder and Paul Bridger’s product startup who were building software that enabled big enterprise customers to offload their data warehousing workloads from expensive proprietary databases onto to cheap, flexible storage and processing running on Hadoop clusters.

What appealed to me about working more formally with Gluent was the opportunity it gave me to work with two smart founders and an even smarter development team developing a product built entirely on big data technology I’d until then only scratched the surface with on consulting gigs. The product marketing role I took on was all about establishing what market that product was primarily intended for, how we went about positioning the product to appeal to that market and how we then brought that product to market.

Understanding these four things are crucial if you’re going to actually get customers to buy your startup’s product:

  • who is the buyer

  • what problem does your product solve

  • what is the value solving that problem, and

  • why you’re the first product to solve it for them

otherwise you’ll spend your time building a solution to a problem that nobody actually has, and that’s the reason the majority of tech startups end-up failing. Solving a problem for a market that’s willing to pay you money to solve is called “product/market fit” and if your product has it, and you’ve built your business such that it scales linearly as more customers discover your product, then you’re going to make a lot more money than a consultancy constrained by how many hours in the week your consultants can work and the upper limit on how much you can charge for a single person’s time.

I also learnt the distinction between product marketing, product management and product development in my time at Gluent. Going back to my time as a consultant attending product roadmap sessions at conferences I never quite knew which parts of the product team those speakers came from, but in summary:

  • Product Marketing is about taking a product that’s typically already built and then deciding the product’s positioning and messaging, then launching the product and ensuring the sales team, sales engineering and customers understand how it works and what it does; as such, this is a marketing role with a bit of technical evangelism thrown in

  • ProductDevelopment is the actual building of the product you’re looking to sell, and requires an engineering skillset together with the inspiration that typically came up with the product idea in the first place along and an entrepreneurial side that made you want to build a company around it

  • Product Management is more of a customer-facing role and is about understanding what your customers want and what their problems and use-cases are, and then creating a strategy, roadmap and feature definition for a product that will meet those needs

Despite my undoubted product marketing skills based around PowerPoint and internet memes:

 
7c48e-1tjqaxhjqadwgs1q_z4hanw.jpeg
 

In product marketing, it’s never too soon to put a Santa hat on a photo of the founder

in the end it I realised that it was product management that interested me the most and, after a couple of meetings with an old friend who used to run product management at Oracle for their business analytics product line and who had recently moved to London and now lead the product team team at Qubit, a technology startup created by four ex-Googlers building marketing technology products based-around Google’s big data and cloud technology, I joined their team later in 2016 as product manager responsible for the analytics features on their platform.

I spoke about the Qubit and the partnership we established with Looker back in May last year at a presentation at Looker’s JOIN 2017 conference in San Francisco and the slide deck below from that event goes into the background to the product and the problem it solves, helping customers using Qubit’s personalization platform make more effective use of the data we collected for them.

The product and data engineering teams at Qubit did an excellent job bringing together the features for this product and in hindsight, the bits I was most proud of included:

  • The business metadata layer we created on Google BigQuery and Google Cloud Platform to translate an event-level normalized many-to-many data model designed for fast data ingestion into a denormalized, dimensional data model designed for easy use with BI and ETL tools

  • Additional integration we created for the Looker BI tool including a set of industry vertical-specific Looker models and dashboards we then made available on the Looker Block Directory and in a Github public repo

bd207-1xm71rz142mmc8pxwnlqd1g.png

Screenshot from Personalization Analytics Block for Looker by Qubit

  • The multi-tenant data warehouse and hosted Looker instance we then put together to enable customers without their own Looker instance to make use of their data in Google BigQuery, doing so in a way that supported per-tenant extensions and customizations by the customer or their implementation partner.

d3090-1ligy1fezlkqsq3dvroghca.png

Technical Architecture for Live Tap as presented at Looker JOIN 2017

What I’ll take-away from my time at Qubit though was the incredible amount that I learnt about product management, product engineering, how to build and run a successful startup and team who are still highly-motivated seven years in and the introduction it gave me to the analytics and data-led world of digital marketing, eCommerce and modern data analytics platforms.

Consulting is a popular route into product management and the experience I brought to the role in areas such as business metadata models, analytical techniques and the needs of BI and ETL developers proved invaluable over the eighteen months I worked as part of Qubit’s product and engineering teams, but moving into product management within a young, technology-led startup founded by ex-Googlers and working with some of the smartest and most innovative people I’ve ever met involved learning a whole new set of skills including:

  • Developing on a new technology platform (Google Cloud Platform) within a new industry (eCommerce and digital marketing) and understanding a whole new set of analytics use-cases and customer roles (A/B testing, stats models and event-based analytics used by analysts and strategists within eCommerce businesses) that I described in a presentation at last year’s UK Oracle User Group Tech Conference in Birmingham:

  • Working as part of a team rather than directing that team, and managing -up as well as down, a technique I had to relearn pretty quickly in my first few months in the role

  • Learning to achieve my goals through influence rather than in the top-down way I’d been used to getting things done leading customer projects, and as CTO and owner of the company that team worked for

  • Saying no to customers rather than yes as you did as a consultant, as your objective is to build a product that solves the most important customer needs but doesn’t burden you with so many features addressing niche use-cases that you end up with Homer’s car and can’t innovate the product in future releases

 
 
  • How to take a product through its lifecycle from identifying a need that makes sense for your company to meet, through prototyping, alpha and beta releases to successful first launch and then creating a strategy and roadmap to manage that product over its complete lifecycle

  • How to use a new generation of modern, cloud-native data analytics tools such as Looker together with products such as FiveTran, Google Cloud Platform, Qubit, Snowflake DB and Snowplow Analytics that were increasingly also being adopted by the FinTech, MarTech and B2C startups clustering in London and other European/North American tech hubs

I learnt so much from my colleagues at Qubit about products, engineering and building a successful and motivated team that put up with my jokes and built the most technologically-advanced marketing personalization platform on the market.

But what my time at Qubit also made clear to me was that, when it came down to it, what really motivated me to get up in the morning, learn all these new technologies and still be wildly excited to come into work in the morning twenty years later was:

  • using data and analytics to find new insights and uncover new opportunities in a customer’s data set

  • working with that individual clients, over time, to enable them to find more of those insights and opportunities themselves

  • find new innovations in analytics technologies and how we deliver projects to make this process cheaper, faster and more likely to be successful

  • and building a team, and crucially a business, to do all of this at scale and offer a full set of analytics-related consulting services built around modern analytics tools and delivery techniques

Which is why after two years away from the consulting business and two enjoyable, rewarding and enlightening years working on the other side of the data and analytics industry I’m now launching my new consulting company, MJR Analytics; and I hope to be working with many of you as clients or members of our team over the coming months and years.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.3

In this series of three blogs on Oracle Analytics Cloud Data Lake Edition I’ve setup an object store data lake in Oracle Cloud using Oracle Big Data Cloud and Oracle Storage Cloud, and ingested streams of real-time event data from IoT and social media sources into Oracle Cloud’s object storage service using Oracle Event Hub Cloud Service.

The event-stream data I staged into Storage Cloud was then copied into parquet files on HDFS and then presented out to BI and ETL tools through Big Data Cloud’s Thrift Server interface, so that now I’m ready, after a short diversion into defining the data engineer role that would typically work with this new product edition, to start exploring some of Oracle Analytics Cloud Data Lake Edition’s new data flow and predictive model preparation features.

The diagram below shows where OAC Data Lake Edition fits into my project architecture, performing the tasks of transforming and enriching the incoming dataset and then presenting my at-scale data out to end-users for analysis using OAC Data Lake Edition’s Data Visualization features.

Looking at the homepage within OAC Data Lake Edition I can see my two Hive tables listed within the dataset catalog, alongside other datasets I’d uploaded directly into OAC. This visual catalog of available datasets is also the new homepage interface that OBIEE12c 12.1.0.4.0 now adopts, with both cloud and on-premises versions of Oracle’s BI tools now relegating the old “Answers” homepage to something you have to dig around and specifically look for in favour of this more self-service Data Visualization starting page.

I’ll have to write an article on Answers and how powerful its interface is, and the full dimensional model it exposes from the Oracle BI Repository, in a blog post sometime in the future as it’s almost in danger of getting forgotten about.

Moving on though, the first transformation I need to do on all the incoming datasets is to take the timestamp column in each table and convert it to a format that OAC recognises as a valid TIMESTAMP datatype format, then convert those columns to TIMESTAMPs so that DV can automatically enable time-series analysis by day, month, quarter, hour and so on. I do that using a feature that’s also present in OAC Standard Edition, the lightweight data preparation interface that’s presented to users when they first add a new data source into OAC’s dataset catalog, shown in the screenshots below.

Where OAC Data Lake Edition gets really interesting right now both in terms of differences vs. the on-premises versions of OBIEE I used to use, and in terms of it’s “data engineering” potential, is with a feature called Data Flows.


Most self-service BI tools now have a basic data loading and data preparation capability today with Tableau Data Prep being one of the latest examples. Designed to handle more complex data prep use-cases than basic datatype changes and field-splitting, they give end-users the ability to do this type of work themselves rather than trying to do it in Excel or handing the work off to the IT department and having to wait days or weeks to get that data back.

Data Flows are a feature that’s been introduced since the original on-premises version of OBIEE12c that I last used when working out in consulting, and provide you with what’s effectively a lightweight, multi-step ETL tool that executes transformations using the BI Server’s Model Extension feature, introduced back when OBIEE12c first came out as the mechanism to enable on-the-fly data mashups between server-side and user-uploaded datasets.

Looking at the transformation operators available in OAC Data Lake Edition v4 there’s quite a few that apply to data lake and data engineering-type workloads including running Python statistical analysis scripts and predictive model training and model build; there’s also an operator for creating an Essbase Cube, with Essbase in this instance positioned as a fast ad-hoc analysis back-end for use with the data visualization part of OAC.

For now though there’s two transformation tasks I want to do with my Hive datasets; first, enrich the incoming social media data by analyzing the sentiment in each tweet and then writing the data plus this sentiment tagging back to the Oracle Big Data Cloud environment, so that I can then turn those sentiment tags into a score and create a data visualization showing who sends me the most tweets and how crazy they are overall.

The second data enrichment I wanted was on some Strava cycling workout data I’d uploaded directly into OAC using the CSV file upload facility; using the model train and build Data Flow operators I defined a model to predict how many “kudos”, the Strava equivalent to Facebook “likes”, I’d get for a given cycle workout with a number of different variables available to the model in order to make the prediction — for example, distance and elevation gain, map location, effort expended and so on.

Then, after running the model build step and looking at the predicted values and the actual ones for the remainder of the dataset not used for model training, you can see the predicted kudos values are fairly in-line with the ones I actually recorded for those rides.


Another feature that’s now in Oracle Analytics Cloud is automated data diagnostics, or Explain. Explain uses machine-learning libraries and that same model extension/XSA BI Server framework to help users quickly understand the value distribution and statistically correlated driving factors for a particular dataset, and learn which segments or cohorts have the highest predictive significance. Enabled by a bunch of extensions to BI Server logical SQL I used the feature first on the sentiment scoring I’d performed earlier on, and then on the steps data I’d brought into Oracle Big Data Cloud from my Fitbit device, after converting the numeric step counts into a text attribute by bucketing its values into low, medium and extreme bucket values.

This is pretty powerful stuff, with automated understanding and context-gaining about new datasets being one of the most user-enabling features I’ve seen arrive recently in BI tools with the best example of this being BeyondCore, now part of Salesforce Einstein. OAC lets the user pick the most useful of the Explain facts and driver insights and publish them to a Data Visualization dashboard like the one below, showing the most predictive and significant variables in my dataset that influence the steps I take each day.

Which leads neatly to the final “data at-scale” feature in OAC, the Data Visualization feature that in my case is querying the ingested, transformed and now enriched datasets I’ve got running on my Oracle Big Data Cloud instance alongside Oracle Event Hub Cloud and Oracle Analytics Cloud Data Lake Edition.


Thank you once again to the Oracle ACE Director program for providing access to Oracle Analytics Cloud Data Lake Edition, Oracle Big Data Cloud and Oracle Event Hub Cloud services over the past few weeks. If you’re looking to try these new services out there’s free trials available for most of Oracle’s Cloud products and many of the new features are also available in Oracle Data Visualization Desktop 12c and Oracle Business Intelligence 12c, both of which can be downloaded for training and evaluation under the OTN license scheme.


Wrapping-up this three part series on Oracle Analytics Cloud Data Lake Edition and Oracle Big Data Cloud I’d like to go back to the two (serious) questions I asked myself at the end of the previous post:

  1. Has OAC Data Lake Edition got anything actually to do with data lakes, and is it a useful tool for aspiring Oracle technology data engineers?
  2. How does it compare to my old favourite Oracle big data product Oracle Big Data Discovery, officially still available and not quite dead yet but existing in some strange zone where the on-premises version stopped getting updates a while ago and the cloud version is for sale but you can’t buy it unless you know the right person to ask and he’s actually gone to Cloudera

So has Oracle Analytics Cloud Data Lake Edition got much to do with actual “data lakes”? Well … it integrates with Oracle Big Data Cloud and apparently comes with an option to run those data flow transformation in Big Data Cloud’s Apache Spark environment, though to be fully-transparent I didn’t see that as an option when doing my evaluation so can’t comment on how well or not that works.

Like Oracle Big Data Discovery before it, OAC Data Lake Edition makes you structure your incoming event stream data with Hive table metadata before you can work with it, but that’s actually fairly standard practice with most data visualization tools that work with Hadoop and data lake environments.

Having Essbase in this product package, alongside the data lake functionality, did make me scratch my head a bit and wonder, “why?” — data lakes and Essbase are about as opposite as you can get in terms of target users and use-cases and I think this Data Lake Edition is as much about creating a product package and price point that’s mid-way between the OAC Standard and Enterprise Edition.

But there is some logic to having Essbase in this edition; it provides a set of easy-to-use loading and preparation tools for Essbase making it easier for customers new to that product to start using it, and Essbase with its latest hybrid ASO/BSO storage format is surprisingly scalable and blindingly-fast to query, a great potential back-end for enabling data analysis “at-scale” using Oracle Analytics Cloud’s data visualization features.

I also get the feeling that this initial v4 version of OAC Data Lake Edition is more of an initial first-cut release to get something out to customers, establish the product package and validate the roadmap and market assumptions. Oracle Analytics Cloud v5 isn’t too far off and I’d expect incremental improvements and new features in areas such as natural language processing and machine learning built-into the developer experience; I wouldn’t be surprised to see Oracle Big Data Preparation Cloud making its way into the product given its obvious fit and overlap with Data Lake Edition’s data prep features.

But where I really see an interesting future for OAC Data Lake Edition is when it starts to integrate product features and the development team from Oracle’s recent acquisition of Sparkline Data.

I came across SNAP, Sparkline‘s platform for building OLAP-style dimensional models over data lakes and cloud object storage layers about a year ago when researching analytics platforms at Qubit and quite frankly, it’s as revolutionary in terms of todays data lake analytics market as OBIEE (or the nQuire Server) was back in 1995 with its virtual data warehouse over application and data warehouse sources.

Take these two slides from the Sparkline website and imagine them as the future of Oracle Analytics Cloud analyzing event-streams and big data cloud-hosted datasets…

and you can see why I’m keen to see where Oracle Analytics Cloud Data Lake Edition goes over the next couple of years. I’m speaking on OAC Data Lake Edition at ODTUG KScope’18 in Orlando in just a couple of weeks time so come along if you’re there, it should be an interesting talk.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.2

A couple of weeks ago I posted a blog on Oracle Analytics Cloud Data Lake Edition and the Oracle Cloud big data products you’d typically use to create a cloud-hosted, object storage data lake. In this and the final post in the series I’m going to look at some of the data transformation and predictive model building features that come with the Data Lake Edition of OAC and extend the base capabilities in this area that OAC Standard Edition comes with, and the data engineering role that uses the data lake this Oracle Analytics Cloud product packaging option is designed for.

As a reminder, OAC Standard Edition is more-or-less Oracle Data Visualization running as a single-tenant cloud-hosted VM (cluster) that’s either customer managed, or Oracle-managed if you go for the new Autonomous Analytics Cloud that interestingly is charged at a lower hourly rate than the customer-managed version; presumably it costs Oracle less to run OAC when it’s doing all the administration rather than letting customers control the settings, which makes sense.

It does make you wonder when all this VM nonsense will go away though as well as the distinction between OAC and the DVCS/BICS managed service cloud analytics products, with all of these services just becoming packaging options for a multitenant service partitioned by customer instance as the likes of Looker and Google Cloud Platform do today.

But for now, Oracle Analytics Cloud Data Lake Edition is described on the product homepage as having the following high-level capabilities compared to OAC Standard Edition:

  • Accelerated analysis via Essbase
  • and support for creating scalable, shareable high performance data flows.

whilst the product documentation states that it includes all the OAC Standard Edition features, plus:

  • Explore and replicate data
  • Create scalable data flows including ingestion, transformation, and persistence
  • Execute data flows by using Apache Spark or Oracle Database Cloud

with two edition-specific scenarios listed:

  • Explore Data Lakes, described as “You can create data flows to visualize data in your data lakes”, and
  • “Replicate Data in Data Lakes”, explained as “You can replicate data from Oracle Service Cloud (RightNow) into Oracle Big Data Cloud Service, Oracle Database Cloud, and Oracle Database. For example, you can replicate data from customer service records in Oracle Service Cloud into your data lake on Oracle Big Data Cloud Service and then perform further data preparation and analysis with data flows and data visualization projects.”

I’m guessing most readers won’t have heard of Oracle Service Cloud but for Oracle Cloud customers it’s quite a useful initial integration, as Service Cloud is Oracle’s cross-channel digital customer call centre application that brings together customer service interactions across web, social media and offline service channels. More interestingly though and you may have missed it back at the start but Data Lake Edition is the edition that ships Essbase, Oracle’s multi-dimensional OLAP server, in what almost seems like an afterthought right now but will become, I suspect, a more significant defining feature of this product package as time goes on.

For the purposes of this series of blog posts though, what I’m really interested in are the data engineering features in OAC Data Lake Edition — but before I start looking into what they might be, I guess I’d better explain what I mean by a data engineer.


I first came across the term data engineer in a blog written by Curt Monash — still the #1 person I’d love to have come on the Drill to Detail Podcast, if you happen to read this Curt — where he wrote:

“I learned some newish terms on my recent trip. They’re meant to solve the problem that “data scientists” used to be folks with remarkably broad skill sets, few of whom actually existed in ideal form. So instead now it is increasingly said that:

- “Data engineers” can code, run clusters, and so on, in support of what’s always been called “data science”. Their knowledge of the math of machine learning/predictive modeling and so on may, however, be limited.

- “Data scientists” can write and run scripts on single nodes; anything more on the engineering side might strain them. But they have no-apologies skills in the areas of modeling/machine learning.”

I think it’s probably true to say that the term really started to enter people’s consciousness, at least the sorts of people I hang around with, when Maxime Beauchemin from Airbnb wrote his seminal blog “The Rise of the Data Engineer” where he introduced this new discipline as:

“Like data scientists, data engineers write code. They’re highly analytical, and are interested in data visualization.

Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science.

In relation to previously existing roles, the data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.

In smaller companies — where no data infrastructure team has yet been formalized — the data engineering role may also cover the workload around setting up and operating the organization’s data infrastructure. This includes tasks like setting up and operating platforms like Hadoop/Hive/HBase, Spark, and the like.”

Maxime’s blog struck a chord with me as that’s pretty much how I’d describe what I do, or at least it’s the work I most enjoy doing; building big data systems to support BI and data warehousing workloads where my knowledge of how Hadoop, cloud and distributed data platforms complements the knowledge that other colleagues have around data science, machine learning and statistical analysis though importantly, I know enough about those areas to be able to deploy the models they build (or as we say over in the UK at least, I know enough about them to be dangerous).

I interviewed Maxime about his blog and the topic of data engineering on the Drill to Detail Podcast last May and if you’ve not done so already I’d recommend you give the episode a listen; for me though, having just spent the last eighteen months working in the London tech startup scene, my definition of a data engineer would be someone who’s just out of university and has an ironic beard, thinks he’s made ETL tools obsolete but is slowly, painfully, recreating one and is now just on the brink of discovering code templating, deployment contexts, change control and data lineage…

… but earns twice what you earn and gets all the girls. Because, data engineers are even more essential than data scientists if you’re serious about building a data lake, and data lakes are the new data warehouses.


Which bring us back to Oracle Analytics Cloud Data Lake Edition and these three blog posts. The first part of this three part series talked about Oracle Big Data Cloud and products such as Oracle Event Hub Cloud — Dedicated (who names these products at Oracle? Seriously?) and in this second I seem to have got rather diverted by trying to define what Oracle Analytics Cloud Data Lake Edition actually is and then even more so by defining the supporting role of a data engineer; in the final post tomorrow I’ll take a look at some of the data transformation, model definition and deployment and data engineering features in OAC Data Lake Edition and then, finally, conclude by asking myself two questions — OK, three:

  1. Has OAC Data Lake Edition got anything actually to do with data lakes, and is it a useful tool for aspiring Oracle technology data engineers?
  2. How does it compare to my old favourite Oracle big data product Oracle Big Data Discovery, officially still available and not quite dead yet but existing in some strange zone where the on-premises version stopped getting updates a while ago and the cloud version is for sale but you can’t buy it unless you know the right person to ask and he’s actually gone to Cloudera
  3. Will the ironic beard and quiff be enough, or do I need to go full ginger if I’m going to really be taken seriously as a data engineer?

Final part in the series on Oracle Analytics Cloud Data Lake Edition comes later this week.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.1

Some time ago I posted a blog on what analytics and big data development looked like on Google Cloud Platform using Google BigQuery as my data store and Looker as the BI tool, with data sourced from social media, wearable and IoT data sources routed through a Fluentd server running on Google Compute Engine. Overall, the project architecture looked like the diagram below…

… and I’ve got more-or-less the same setup running right now, with an additional GCE VM running Confluent Open Source to feed a subset of the event streams into a Druid cluster that I’m using to test out Looker, Superset and Imply for sub-second ad-hoc query analysis use-cases. More on that soon.

If you’re a regular reader of this blog you might recall a week or so ago I posted a blog on the new releases of Oracle Business Intelligence Enterprise Edition (OBIEE), Oracle Analytics Cloud (OAC) and Oracle Data Visualization Desktop (DVD) and mentioned a new packaging option for their cloud analytics product, Oracle Analytics Cloud Data Lake Edition. I’ve got a particular interest in what this product might be as I used to use the product it replaces, Oracle Big Data Discovery (BDD), fairly extensively in Oracle big data analytics and data lake projects a few years ago.

And Oracle Big Data Discovery was — technically at least — a great product. It combined the search and analytics features of Endeca Information Discovery with the scale and data transformation abilities enabled by Apache Spark and Hadoop, but suffered perhaps by being a bit ahead of the market and by not having any obvious integration with the rest of Oracle’s analytics and data management tools. By contrast Oracle Analytics Cloud Data Lake Edition is one of three packaging options for Oracle Analytics Cloud and includes all of the functionality of OAC Standard Edition (Oracle Data Visualization together with basic data preparation tools) as well as itself being a subset of the wider set of analysis, dashboarding and enterprise data modeling features in OAC Enterprise Edition.

An equivalent product architecture for ingesting, transforming and analyzing my IoT, wearables and social media data in Oracle Cloud would look something like the diagram below, with the following Oracle Cloud Platform-as-a-Service (PaaS) products used for ingest, storage and analysis:

  • Oracle Event Hub Cloud Service: Apache Kafka running either customer or Oracle-managed with full access to Kafka’s REST Proxy and Kafka Connect
  • Oracle Big Data Cloud: Oracle’s new elastically-scalable Hadoop platform running Apache Spark, Ambari and other Hortonworks Data Platform components
  • Oracle Analytics Cloud Data Lake Edition: Oracle Data Visualization combined with more extensive lightweight ETL (“data flow”) components, text analytics and machine learning model training and build capabilities

In this example I’m using Apache Hive and Parquet storage as my column-orientated data platform but of course I’ve now also got Oracle Autonomous Data Warehouse Cloud as an option; I’ll stick with Hive on Oracle Big Data Cloud for now though as this gives me the option to use Apache Spark to transform and wrangle my data and for building machine learning models using SparkML and, via pySpark, Python Pandas. In what’s the first of two posts in this short series I’ll be looking at how the data pipeline is set up, and then in the second post I’ll look at Oracle Analytics Cloud Data Lake Edition in detail focusing on the data transformation, data engineering and data science features it adds beyond OAC Standard Edition.


The development environment I put together for this scenario used the following Oracle Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) components:

  • Oracle Compute Classic and Storage Classic Services
  • Oracle Database Cloud Service, with the 11g database option
  • Oracle Event Hub Cloud Service Dedicated, with Kafka Connect and REST Proxy nodes
  • Oracle Big Data Cloud, single node with Hive, Spark 2.1, Tez, HDFS, Zookeeper, Zeppelin, Pig and Ambari
  • Oracle Analytics Cloud Data Lake Edition with Self Service Data Preparation, Visualisation and Smart Discovery (aka Oracle DV)

The screenshot below from the Compute Classic Service Console shows the various PaaS VMs running on the compute IaaS layer, with other tabs in this console showing network, storage and other infrastructure service usage.

The order in which you install the services is important if you want to associate the various products together correctly, and if like me you’re using a trial account you’ll need to plan carefully to ensure you keep with the various quota limits that Oracle Cloud imposes on trial identity domains that stopped me, for example, allocating the usual two OCPUs to the main Oracle Event Hub Kafka server if I wanted to run the rest of the stack at the same time.

Associating two services together, for example Oracle Event Hub and Oracle Big Data Cloud, connects the two services together automatically in the identity domain network and makes using them together much simpler but unless Event Hub Cloud is provisioned and available when you come to install Big Data Cloud you can’t go back and associate them afterwards, but more annoyingly if you decide you want to use Event Hub Cloud and associate it with Big Data Cloud but since then you’ve provisioned Oracle Analytics Cloud Data Lake Edition and associated Big Data Cloud with that you have to unwind the whole provisioning process and start again with Event Hub Cloud if you want to connect them all together properly. And forget deleting that Database Cloud Service you associated with Oracle Big Data Cloud and then forgot about as you can’t delete services that other services are associated with.

Provisioning each of the services involves giving the service instance a name, assigning storage buckets and OCPU counts to the various cluster nodes you request, and at key points selecting previously provisioned and now running services for association with the one you’re now provisioning. The screenshots below show the three-stage provisioning service for Event Hub Cloud Service — Dedicated:

Provisioning a new Event Cloud Service — Dedicated cluster

and the order in which I provisioned my data lake services, and the important options I chose to make it all work together and within quota, were as follows:

  1. First ensure you have access to Oracle Cloud, and then the various IaaS services: Oracle Cloud Infrastructure Compute Classic, Oracle Cloud Infrastructure Object Storage Classic and Oracle Identity Cloud Service
  2. Provision Oracle Database Cloud Service to store the various Fusion Middleware RCU schemas; in my case I chose Oracle Database 11gR2 as the database type as it avoids the complexity around CDBs and PDBs you get with the 12c database release
  3. Then provision Oracle Event Hub Cloud — Dedicated with one OCPU for the main Kafka VM, one for the REST Proxy VM and another one OCPU for the Kafka Connect VM. For a real deployment you’d want at least two OCPUs for the Kafka service VM but using just the one kept me within my overall OCPU quota limit when installing the rest of the stack
  4. Next step is to provision Big Data Cloud with a single node with the minimum 2 OCPUs, the Full deployment profile and version 2.1 of Apache Spark as that’s the version OAC Data Lake Edition insist on in the next step. When prompted, choose the option to associate Event Hub Cloud and Database Cloud with Big Data Cloud as you won’t get the option to do this again after the initial service provision; once provisioned, open-up the TCP port for Ambari (8080) to the public internet so that OAC in the next step can associate with it — provisioning for OAC failed for me every time until I looked through the provisioning logs and spotted this as the issue
  5. Finally, provision Oracle Analytics Cloud and choose Data Lake Edition as the package option, again in my case assigning a single OCPU and selecting Data Lake Edition as the software version

At that point if you then bring up the Cloud Services dashboard and review the services together for the first time, it’ll look something like this:

Oracle Data Lake stack within Cloud Services Dashboard

Now it’s time to ingest some data and land it into Oracle Big Data Cloud.


The streaming IoT, wearables and social media comms data that I’ll be ingesting into Big Data Cloud will be coming in from the public internet over TCP, and I’ll also want to connect to Event Hub Cloud from my desktop using tools such as Kafka Tool so an addition configuration step I’ll do before setting up anything else is to open-up Event Hub Cloud’s Kafka broker endpoint to the public internet using the Access Rules menu item in the Event Hub Cloud console.

Now I can see the Kafka service and the default topics that Event Hub Service created for me in Kafka tool.

I can either then use Kafka tool to create a new topic to start receiving the first of my data streams, the IoT device event data coming out of Samsung SmartThings, or create the topic as by defining a new Event Hub Cloud service from within the Event Hub Cloud Service — Dedicated console (confusing, but that’s how Kafka topics are named within Event Hub Cloud)

Then it’s just a case of directing the stream of IoT event data to the public Kafka broker endpoint exposed by Event Hub Cloud Service — Dedicated and then, after a short while, checking the metrics for the new Kafka topic that I setup to receive this incoming streaming data.

Getting the data off the Kafka topic and into a Hive table on the Big Data Cloud instance involved the following steps, using Oracle Cloud Infrastructure Object Storage Classic as the intermediate staging layer together with Event Hub Kafka Connect’s OCS Sink Connector:

  1. Configure Event Hub Kafka Connect OCS Sink Connector to push topic events to Oracle Cloud Infrastructure Object Storage Classic (OCS)
  2. Using Zeppelin notebook provided by Big Data Cloud Console, create a CRON job that copies those events across to HDFS storage
  3. Create Hive external tables with location clauses that point to the directories I’ve copied the event files into from OCS

Then, when I go and log into OAC Data Lake Edition and connect to the Hive Thrift Server on the Big Data Cloud instance I can see the Hive tables I’ve just created, and the data that’s now streaming through from my two initial sources via the Kafka service and Kafka Connect running on Event Hub Cloud Service — Dedicated.

In the second half of this two-post series I’ll go deeper into OAC Data Lake Edition and see how its additional transformation and analysis capabilities stack-up against OAC Standard Edition, and also see how it compares to the Oracle Big Data Discovery tool its looking to eventually replace.

Using Looker Data Actions to Make Monzo Spend Analysis More Interactive … and Actionable

The other week I posted a blog about analyzing the bank transaction data Monzo make available to customers via their developer API using Looker, the BI tool I use day-to-day in my role at Qubit as Analytics Product Manager.

One of the dashboards I’ve created shows me how spend looks across a number of merchant categories, compares their average transaction values and higlights the one I visit the most (The Ginger Dog in Brighton, thoroughly recommended) and plots each of the them on a map so I can remember where they’re each located.

If I click on one of the map icons on the Spend by Location visualization Looker then pops-up a dialog showing me the individual column values for this particular merchant — the Hardwood Arms in Fulham, one visit and total/average spend amounts of £75 exactly.

As well as tagging your transactions with metadata from the Foursquare web service Monzo also adds the Twitter handle for each merchant along with their website URL; I’ll start to make this dashboard a bit more interactive then by adding these as link parameters to my merchant_name dimension definition in my LookML view definition.

dimension: merchant_name {
group_label: "Monzo Transactions"
label: "Merchant Name"
type: string
sql: ${TABLE}.merchant_name ;;

link: {label:"Merchant Website"
url:"https://{{ fluentd_monzo.merchant_metadata_website._value }}"
icon_url: "https://www.google.com/s2/favicons?domain=google.com"
}
link: {label:"Merchant Twitter Profile"
url:"https://twitter.com/{{ fluentd_monzo.merchant_metadata_twitter_id._value }}"
icon_url: "https://www.google.com/s2/favicons?domain=twitter.com"
}

Monzo also provides Google Places API and Foursquare API merchant metadata lookup codes along with each transaction record, so I add two more link parameters to call these two web services.

link: {label:"View in Google Maps"
url: "https://www.google.com/maps/search/?api=1&query={{ fluentd_monzo.merchant_address_latitude._value }},{{ fluentd_monzo.merchant_address_longitude._value }}&query_place_id={{ fluentd_monzo.merchant_metadata_google_places_id._value }}"
icon_url: "https://www.google.com/s2/favicons?domain=maps.google.com"
}
link: {label:"View in Foursquare"
url: "{{ fluentd_monzo.merchant_metadata_foursquare_website._value }}"
icon_url: "https://www.google.com/s2/favicons?domain=foursquare.com"
}

Now when I click on the name of a merchant within any data visualization using that data point I get a menu of links I can click on for more information.

Clicking on the View in Google Maps link passes across the merchant’s latitude, longitude and Google Places API key to the Google Maps web service, and Looker then opens a new browser tab and gives me all the information I’m looking for using the Google Places web service.

All of this is good to know but it’s only really of value if I can make this data actionable, for example by linking it to the Google Tasks web service I use to record personal, non-project related tasks and reminders. To do this I’m going to use Looker’s Data Actions feature together with Zapier’s integration with Google Tasks to tie it all together.

First, I create a Zapier Zap that listens for a webhook call and then uses the parameter values I provide from the Looker data action to to create a new Google Task. This article on the Looker Discourse website goes into more detail on how to set up Zapier webhooks integration with data actions if you’d like to set up something similar yourself.

Then I add an action parameter into my merchant LookML dimension definition that defines a form with default values, allows me to amend those values if needed and then HTTP POSTs them as the payload to the Zapier web service.

action: {
label: "Add Google Task"
url: "https://hooks.zapier.com/hooks/catch/XXXXXX/fco8nm/"
icon_url: "https://www.google.com/s2/favicons?domain=tasks.google.com"
form_param: {
name: "Task Name"
type: string
required: yes
default: "{{ merchant_name }}"
}
form_param: {
name: "Due Date"
type: string
required: yes
}
form_param: {
name: "Action"
type: textarea
default: "Book table at {{ merchant_name }}"
}
form_param: {
name: "Urgency"
type: select
option: {
name: "Low"
label: "Low"
}
option: {
name: "High"
label: "High"
}}}

Now, another menu link item is displayed to allow me to create Google Task items when I click on merchants in a Monzo transactions dashboard, like this:

When I click on that link a dialog is then shown, giving me the option to amend the task title and details and add a due date and urgency to the task I’m about to define.

Then finally, I press Submit and then go over to the Google Tasks web interface to see my new task recorded and ready for me to action tomorrow morning.

Updates to Oracle Analytics Cloud, Oracle BIEE 12c and Oracle DV Desktop

Late last year I covered one of the first releases of Oracle Analytics Cloud (OAC) on this blog covering the v3 release of OAC; since then the product has had a major UI refresh with OAC v4 and so I thought it’d be worth covering this new release along with updates to OBIEE12c and DV Desktop in a follow-up to last year’s post.

OAC’s original look and feel was based on OBIEE12c v1 that itself contained the first iteration of Oracle Data Visualisation (DV) and whilst the new modern flat UI that release introduced was an improvement on the 11g release before it since then Oracle have rapidly iterated with Oracle DV Desktop and DV Cloud Service and the UI from those products is now the primary user interface for Oracle Analytics Cloud v4.

If you’ve not used Oracle Analytics since the OAC v3 and the initial OBIEE12c versions you’ll be surprised how much the user experience has changed since then; when OBIEE12c first came out most of us still considered Answers and Dashboards to be the primary reporting UI with the BI Repository being the central carefully governed and dimensionally-modelled source of all reporting data.

Since then the world has moved-on and analytics is now all about empowering the end-user through self-service, easy-to-use tools that enable you to do most tasks without having to involve the IT department or do lots of up-front data modeling and data onboarding. Oracle DV Desktop started meeting this need by introducing basic data preparation features into the product so users could upload spreadsheets and other other data and do some basic formatting and tidying before analyzing it on their desktop, and those data upload and preparation features are now available for OAC users in the v4 release. To upload a spreadsheet file into OAC you now just drop it onto the web page and you’re then given a preview and the option to upload its contents to the analytics instance.

After that you can trim, split, derive and change the datatypes of data you upload so that it works in the most optimal way with the analysis features of the product, for example by turning dates in string formats not recognized by Oracle DV into proper date datatypes that can then be used in time-series analysis.

For data transformation needs that go beyond basic formatting and datatype changes you can now build ETL-style routines that filter, aggregate, transform and add external datasets all using simple point-and-click user tools, and if you’ve licensed the new OAC Data Lake Edition these transformations then extend to include model training, forecasting and sentiment analysis. In the example below I’m using OAC Data Lake Edition to forecast weight forward from a set of smart scale readings that’s then executed at the BI Server level using Python maths and statistics libraries.

I’ll actually be talking about OAC Data Lake Edition in my presentation at ODTUG KScope’18 at Walt Disney World Florida this June, where I’ll cover some of these extended data flow features along with Oracle Big Data Cloud and Oracle Event Hub Cloud Service in my session “BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition”.


Cloud of course is great but not every Oracle Analytics customers has made the move yet, or like myself you might have a cloud instance you can spin-up as-and-when you need it but then use regular on-premises OBIEE12c in a VM for your day-to-day learning and experimentation.

With Oracle’s cloud-first strategy and focus on Oracle DV as the main innovation path for analytics features this meant that my OBIEE12c instance running in a Google Compute Engine VM was starting to lag behind OAC v4 and DV Desktop in-terms of look-and-feel and all of these new features, and so I was pleased to note the other day that the 12.2.1.4.0 (aka v4) release of OBIEE12c is now available for download on OTN that includes many of the features currently available in OAC v4 and DV Desktop (but not the advanced data flow features in OAC Data Lake Edition, for example)

Oracle Data Visualization Desktop has also been updated recently to connect to Oracle Autonomous Data Warehouse Cloud and I’ll be covering that new capability in a blog post in the near future; but more importantly for me at least….

… it also now supports Mac OS X 10.13 High Sierra, great news for us Mac users who had to stop using DV Desktop on our laptops late last year when that most recent update to OS X came out but DV Desktop didn’t yet support it.

Timeline Charts, Derived Tables and Analytic Functions in Looker 5

One of the data streams that comes into my personal Google BigQuery data warehouse is a record of all the Hue light switch settings and illuminance levels, Nest thermostat settings and motion sensor readings coming in via the Samsung Smart Things hub under the stairs back home. For the Hue lightbulbs I get a stream 1s and 0s that initially get stored in a four-column table in BigQuery, like this:

I then create a SQL view over that table that joins to a lookup table containing device types and room allocated for each of the devices in the data stream, and sets a metric type for each recording so that I can analyze all the thermostat readings together, bring together all the light switch settings and so on.

Its then quite straightforward to bring this SQL view into Looker and add it to the rest of the data streams I’ve brought together there so I can see, for example, whether the kitchen lights get turned-off after everyone’s gone upstairs after breakfast or whether I should try and automate those as well as my kettle.

Looker 5 introduced another way to visualize data with defined start and end events in the form of the new Timeline Chart, a feature I tweeted about the other day when I brought data on tasks from Asana into my BigQuery data warehouse via a Fivetran integration.

It’d be interesting to use this type of chart to show how one particular light switch changed from on to off and then on again over a period of time, or take all the Hue lights for a given part of the house and see whether anybody turns the lights off ever … or whether they need to be linked to the motion sensors and automatically turned-off after a period of inactivity in each room.

To use this type of chart I first need to create an “end date” field for each of my event records, so each timeline has a start and end date with other fields being optional for label and value. Looking at the list of events for the light switch I looked at earlier and ordering the events by time recorded you can see the information I need is there:

SELECT
*
FROM
`aerial-vehicle-148023.personal_metrics.smartthings_readings`
WHERE
device = 'kitchen seating area spot 1'
AND raw_value IN ('off','on')
ORDER BY
date_time

What this calls for is the LEAD BigQuery Standard SQL analytic function that returns a value based on a subsequent row to the one we’re working with, and I can combine that with the TIMESTAMP_DIFF function to compute the time between the start and end event dates as that would be useful to know too. The SQL expressions for the two new fields would therefore be:

LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds

I could add these column expressions into the SQL view definition in BigQuery and order the rows by device, metric and timestamp, like this:

SELECT
date_time as date_time,
device as device,
metric,
value,
raw_value,
LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) as end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds
from `aerial-vehicle-148023.personal_metrics.smartthings_readings`
where device = 'kitchen seating area spot 1'
and metric = 'switch'
and raw_value in ('on','off')
order by 2,3,1

Executing that SQL shows the logic is working, but I’d then have to maintain that view within BigQuery and that might not be the most convenient place to add new code.

Instead, I could just go into Looker and create a new view there based on a derived table SQL expression and do the work there. Right now my LookML model looks like the excerpt below, where you can see the sensor readings view joined into the rest of the explore so all my data can be analyzed together.

connection: "rittman_bigquery"
include: "*.view.lkml" # include all views in this project
include: "*.dashboard.lookml" # include all dashboards in this project
explore: date_dim {
case_sensitive: no
label: "Data Warehouse"
join: fluentd_transactions {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_transactions.date_minute5} ;;
relationship: many_to_many
}
join: smartthings_readings {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${smartthings_readings.date_minute5} ;;
relationship: one_to_many
}
join: fluentd_uber_rides {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_uber_rides.date_minute5} ;;
relationship: many_to_many
}
}

I now create a new LookML view that uses a derived table SQL query as the view definition rather than simply referencing an existing BigQuery table. Note how I’ve used the same view_label as the main LookML view containing my event data, so that the dimensions and metric I define here appear alongside all the other smart device fields the same explore view.

view: device_event_end_and_timespan {
view_label: "6 - Smart Devices"
derived_table: {
sql: SELECT
date_time as date_time,
device as device,
metric as metric,
LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) as end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds
from `aerial-vehicle-148023.personal_metrics.smartthings_readings`
order by 2,3,1;;
}
dimension: date_time {
type: date_time
hidden: yes
sql: ${TABLE}.date_time ;;
}
dimension: device {
type: string
hidden: yes
sql: ${TABLE}.device ;;
}
dimension: metric {
type: string
hidden: yes
sql: ${TABLE}.metric ;;
}
dimension_group: end_date_time {
group_label: "End Date"
label: "End"
type: time
timeframes: [
raw,
time,
hour,
hour3,
hour4,
hour6,
hour12,
hour_of_day,
time_of_day,
minute,
minute5,
minute10,
minute15,
minute30,
day_of_week,
day_of_month,
day_of_year,
date,
week,
week_of_year,
month,
month_name,
month_num,
quarter,
year
]
sql: ${TABLE}.end_date_time ;;
}
dimension: pk {
primary_key: yes
hidden: yes
type: string
sql: concat(cast(${TABLE}.start_date_time as string), ${TABLE}.device) ;;
}
measure: value_duration_seconds {
type: average
sql: ${TABLE}.value_duration_seconds ;;
}
}

Then I join this derived table view back into the explore within my LookML model, as below:

connection: "rittman_bigquery"
include: "*.view.lkml" # include all views in this project
include: "*.dashboard.lookml" # include all dashboards in this project
explore: date_dim {
case_sensitive: no
label: "Data Warehouse"
join: fluentd_transactions {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_transactions.date_minute5} ;;
relationship: many_to_many
}
join: smartthings_readings {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${smartthings_readings.date_minute5} ;;
relationship: one_to_many
}
join: device_event_end_and_timespan {
type: left_outer
sql_on: ${smartthings_readings.date_time} = ${device_event_end_and_timespan.date_time} and
${smartthings_readings.device} = ${device_event_end_and_timespan.device} and
${smartthings_readings.metric} = ${device_event_end_and_timespan.metric};;
relationship: one_to_one
}
join: fluentd_uber_rides {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_uber_rides.date_minute5} ;;
relationship: many_to_many
}

and now I can create timeline charts that show me which lights were on over a particular period of time, like this:

or more importantly, work out why the bathroom never seems to be free when you’ve got two teenage kids in the house.

Monzo, BigQuery, Looker and “Fintech”: The Other London Tech Startup Scene

I’ve written a number of blogs over the past few months about bringing my fitness, smart devices and other digital data sources into Google BigQuery and Looker to do interesting analysis with the combined dataset, whilst also keep my tech skills up to date and relevant for my day job as Product Manager for Analytics at Qubit, a London tech startup in the hot new MarTech (“Marketing Technology”) space.

At Qubit we also use Google BigQuery and Looker to enable our customers to run fast, interactive queries against petabytes of consumer behavioural data whilst BigQuery abstracts away the hassles in running a production data warehouse, whilst Looker abstracts away the complexities of SQL, joining datasets together and trying to visualize it all in D3 or RStudio.

One big gap in the data I’d managed to collect up until recently was anything to do with what I actually spent each day; whilst my current bank along with most others provides a means to manually download statement data in Excel or CSV format, or supports commercial services that load that data into your personal finance software for a monthly fee, I wanted to access my data in the same way that I accessed my health, timekeeping and cycling workout data … through a REST API, cURL and a cron job that loads new data into BigQuery on the hour every day.

So that’s why I smiled when I heard about Monzo, another tech startup just a mile or so from my office in London that’s disrupting another industry: consumer banking, with Monzo, Starling and other new app-only banks taking advantage of new regulations such as the EU’s new Payment Services Directive (PSD2) that open up existing bank accounts to authorised third-parties looking to provide a better user experience (UX) over your existing bank arrangements; or as in Monzo’s case, actually obtaining a banking license and offering a complete end-to-end service that focuses on solving customer problems rather than you providing a cheap source of money to lend for the bank.

Along with other technology-led startups working in the financial services space Monzo are the best example yet of a compelling new financial services company with a strong appeal to consumers more familiar today with Facebook than chequebooks.

Image Courtesy of Medici, “Awe-Inspiring London FinTech Industry Is Firing on All Cylinders”

Me, I was more interested in the fact that Monzo also use Looker and Google BigQuery to run their analytics infrastructure, and that they also offer a developer-orientated REST API that provides exactly what I’ve been looking for to enable myself and increasingly others in the developer community to download, analyze and visualize their spending data and start to build applications that when the full Monzo API comes out will revolutionize how we do our banking in the coming years.


To start analyzing my banking data through Monzo I first needed to do a historic batch download of all my historic transactions and top-ups and put that into a BigQuery table, and then setup a trigger on Monzo’s systems that sends across all the subsequent transactions as they happen to keep my data in-sync with Monzo’s record of my bank account.

To set up the initial transaction history transfer I first registered at https://developers.monzo.com/ and then generated an access token for my original pre-paid Monzo card and the current account card that recently replaced it. Note that this access token only lasts for a short time so you can’t generate it just once and use it in a script forever, and more importantly the developer API is meant just for personal use and can’t be used to access other people’s accounts, so you can’t build one of the new apps made possible by the UK’s new open banking regulations just yet (but this is coming soon through their interim AIS API now in testing with a small number of third-party developers)

To find out the account IDs for your Monzo accounts either use the API playground web app as I’ve done in the screenshot above, or use the Developer API now for the first time to get them yourself, via the cURL tool:

curl -G “https://api.monzo.com/accounts" -H “Authorization: Bearer <<YOUR_ACCESS_TOKEN_HERE>>" ./accounts.json

and then retrieve your account IDs from the JSON output through a tool such as jsonpp (brew install jsonpp if you’re on a Mac, similar tools for other OSs)

{
"accounts": [
{
"id": "acc_<<MY_PREPAID_CARD_ID>>"
"created": "2017-04-06T07:07:15.175Z",
"description": "Mark Rittman",
"type": "uk_prepaid"
},
{
"id": "acc_<<MY_CURRENT_ACCOUNT_ID>>"
"created": "2017-11-28T08:35:28.323Z",
"description": "Mark Robert Rittman",
"account_number": "<<MY_AC_NUMBER>",
"sort_code": "<<MY_SORTCODE>>",
"type": "uk_retail"
}
]
}

Then you can retrieve all your account transactions with another cURL request, like this one for my current account:

curl -G "https://api.monzo.com/transactions" -d account_id=acc_<<MY_CURRENT_ACCOUNT_ID>> -H "Authorization: Bearer <<YOUR_ACCESS_TOKEN_HERE>>" > /transactions.json

and then you can use tools such as “sed” (to strip out the transactions: []) array from the json output) and one of the many json2csv converters out there (for example this one that has npm as a prerequisite, something I’m guessing most people attempting this sort of project will have already, if not then first install npm via brew and you’re ready)

sed -i -e 's/{"transactions"://g' ./transactions.json
sed -i -e 's/]}/]/g' ./transactions.json
brew install json2csv
json2csv -i ./transactions.json -f created,amount ,description,currency,category,include_in_spending,<<ANY_OTHER_REQUIRED_FIELDS>> -o transactions.csv

There are many more fields available to extract from the JSON documented downloaded via the developer API depending on whether you’re accessing a Monzo prepaid card or current account; the developer docs on the Monzo site go through some of them but you’re best going through a handful of transactions to see what’s available and then create your own custom -f extract list.

Then as the final step I use the bq command-line tool to load the data in the CSV file output in the previous step into a BigQuery table, like this:

bq load --source_format=CSV --replace=true --skip_leading_rows=1 --autodetect dw_staging.monzo_transactions_stg ./transactions.csv

and then go into the BigQuery Web UI to check my transaction data has come in as expected.

In-practice for my analysis exercise I brought in a lot more fields including merchant name and map X/Y location, Foursquare metadata and other fields that are only available for the current account product, so check through the JSON elements available for your particular Monzo product and extract the ones that most interest you.


So what about capturing transactions after this date, how do we do that without some complicated process that downloads all the transactions again and then applies just the ones that are new to the BigQuery table? The way I bring in all my other app, smart device and other event data into BigQuery is to either post them directly to a fluentd server running on Google Compute Engine or use services such as IFTTT and Zapier that interface with my apps and then post to fluentd using outgoing webhook services, or their “Maker Channel” as IFTTT used to call their service when I created for a blog post back in 2016.

The Monzo Developer API also comes with a webhook feature that lets you register a URL, for example the one use when posting all my other event data to fluentd, with all subsequent transaction data then being posted to that server URL and then onto the various BigQuery tables that store my data for analysis.

In the case of the Monzo transactions I added an intermediate step and had the Monzo webhook post to a Zapier “zap” that made it easier to decode the nested JSON elements Monzo was sending over that held merchant and other payment processor-related information.

Zapier then sends over just those elements of the transaction I’m interested in to the fluentd server which then streams them into BigQuery using the fluentd BigQuery plugin. From there I can then write SQL to analyze this data in BigQuery’s web UI, remembering to divide the transaction amount by 100 as they’re all stored as pence in integer datatypes within the Monzo system, and to remove any top-ups, declines and and other non-spend transactions from the results by filtering on the “include_in_spending” column to remove all records with “false” as the predicate value.


Analyzing the data in a tool such as Looker makes things much more interesting as I can start to make use of the locational metadata for merchants that Monzo provide for current account transactions, showing me for example where I took out cash from ATMs on the way back home from the pub and then forgot I’d done so the morning after.

Or I can see how my spend on essentials like entertainment and eating out rises in the summer, especially when I appear to do away with non-essentials like food and other groceries.

I can see where it was I spend the most money when out on cycling trips in the summer, mostly in pubs around the New Forest and up near Bath, Frome and Warminster (and how I wish winter would end and I can get back out on my bike again).

I can even work out how my choice of restaurant and entertainment changed from lots of spend on music in August through to lots of eating out in October.

And more importantly, I can see I’m spending far too much money on coffees and breakfast on the way into work each day, but that’s the downside I guess of working just off Covent Garden in what must be one of the nicest work locations I’ve ever had the privilege to base myself out of.

The beauty of Looker is that I can define a data model in LookML that takes table structures and SQL expressions and turns them into more user-friendly measure and dimension definitions, like this:

These LookML “views” can then be joined to other views that share a common join key — in my case, the truncated “date” part of a date and time timestamp field — that then makes it possible for me to analyze spend data in the context of where I was on that day and that time; whether I was out on the bike and if so whether I’d had a good workout or not so good; what the weather was like on the day and what I’d been chatting about on Slack at the time; whether the spend was for an Uber ride and if so where to.

Analysis that shows me, now I’m finally able to join my financial and activity dots together for the first time, where all my money’s gone since the nights got darker and I’m clearly in the pub rather than out on my bike. Lucky spring is around the corner or I’ll be applying for Monzo’s new overdraft service that’s just gone into review rather than registering new Strava personal records next month.

Oracle Analytics Cloud … and the (Welcome) Return of the Enterprise BI Platform

Oracle Analytics Cloud, the “full” implementation of Oracle’s business analytics platform tools in Oracle Public Cloud, was released back in May 2017 and I covered its basic product proposition in a Drill to Detail Podcast episode with Oracle’s Vasu Murthy a few weeks just before then.

Up until now I’d not really had a chance to properly review Oracle Analytics Cloud as the products I work with right now are mainly based around Google’s Cloud Platform and the Looker BI tool, but Oracle kindly made available some cloud promotion credits through their Oracle ACE Director community program and so I gave it as well as a few other supporting cloud services a spin over the Christmas break.

In the end it was only really a high-level evaluation but with my previous experience with Oracle’s BI Cloud Service, Data Visualization Desktop and Cloud Services and the on-premises versions of their BI Enterprise Edition product line I was able to get up and running fairly quickly. More interestingly for me was trying to work out who the target market is for Oracle Analytics Cloud and what particular use-cases it tries to address; why Oracle released Oracle Analytics Cloud when they already have Oracle BI Cloud Service and how they differentiate the two products and the two markets they presumably serve; and to try and speculate on why Oracle made the choices it made when coming-up with Oracle Analytics Cloud given the competition it faces in the market.


For anyone not so familiar with Oracle’s business analytics products you’ve basically got a classic end-to-end enterprise BI platform, Oracle Business Intelligence 12c, that covers everything from reporting tools through to business metadata modeling to application and database server integration that typically sold to the IT function within corporate and government customers and was recently joined by Oracle Data Visualization that turned this all on its head and sold to the sorts of buyers who were now buying Tableau on their company credit cards and bypassing the IT department so that they could actually get something done.

Oracle’s first forays into moving all this into the cloud were focused again on the needs of these non-technical business buyers starting with Oracle Business Intelligence Cloud Service (“BICS”), a departmental cut-down version of Oracle Business Intelligence that I covered in an article for Oracle Magazine when the product first came out. Over time BICS was extended to include data visualization capabilities that were being introduced with the Data Visualization Option for the new Oracle Business Intelligence 12c release and then just those visualization features were packaged up into another end-user focused cloud service, Oracle Data Visualization Cloud Service, that became available around two years ago.

One thing you couldn’t do with any of those cloud tools though was migrate your on-premises Oracle BI deployments into Oracle’s Public Cloud, as BICS made you use a cut-down version of the Oracle Database along with a much-simplified data modeling tool to store and organize your data for reporting, but around the same time as Data Visualization Cloud Service came out it started to become possible to “lift-and-shift” your full enterprise business metadata models into BICS and run them against full Oracle Database instance running in the cloud or on-premises databases using a special secure connector between Oracle’s cloud data centres and the one running your Oracle database servers.

“Lifting and Shifting” an on-premises RPD into Oracle BI Cloud Service

What some customers still wanted though was more than this; BICS, DVCS and all the other BI products Oracle offered were fully-managed services, not unlike Looker and Google’s BigQuery distributed database service that I use day-to-day in my current product management role at Qubit. What these customers wanted was full on-premise Oracle Business Intelligence running in Oracle Public Cloud that they could then connect into securely and manage the configuration settings to suit just their particular needs, and choose when to apply patches and when to run backups to suit their standard operating model.

What they also wanted was full unfettered access to Oracle’s BI Server metadata layer, so they could not only upload their on-premises business metadata models but then extend and update them and at the same time incorporate elements of functionality from Oracle’s Essbase Server that had also recently made its own transition into Oracle Public Cloud. Crucially, these customers still wanted to pay for all of these software costs monthly out of OpEx but also wanted the license metric to move back to per-processor rather than the named-used basis BICS and DVCS used, so they could then roll-out cloud analytics to everyone in the organization rather than just a few users and also spin-up test, development and pre-production environments if they’d licensed enough processor capacity to run them all. This, then, was the set of requirements Oracle Analytics Cloud was put together to meet, and so I took a look at what v1 of this new product looked like over the time between Christmas and New Year.


Installing Oracle Analytics Cloud is more like installing Tableau Server, or one of the other pre-rolled BI server software VMs you find on services such as AWS Marketplace, than than getting a new account on BICS or one of the other Oracle fully-managed service running on Oracle Public Cloud. You first have to set up some object storage space using Oracle Cloud Storage to hold the various configuration and log files OAC will require, then setup a database instance using Oracle Database Cloud Service to hold the RCU schema that OAC will require.

Creating the Oracle Analytics Cloud service

Oh, and you almost definitely need some prior knowledge of how to setup and configure Oracle software, and the previous on-premises version of Oracle Business Intelligence, before any of this makes sense; Oracle Analytics Cloud is definitely aimed at the IT department rather than casual business users, and moreover those IT departments already invested in Oracle’s analytics, database and cloud technologies for whom this will seem very familiar and in-fact will seem very reassuring — this is Oracle’s full on-premise functionality ported up into the cloud, not some cut-down managed-for-you version that hides all the controls and is impossible to manage via your own DevOps scripting.

Truth be told, I found the installation process just baffling when I first ran through it; being more used to services such as Looker and Google’s BigQuery, Cloud DataPrep and Google Data Studio products these days being presented with software where you actually need to understand the installation process and what it’s doing, and that process requires mandatory knowledge of other current and historic products relating to the one I was installing, seemed crazy in-comparison to the startup-friendly SaaS products I’m more familiar with these days.

But every organization isn’t a startup, and every user of an installer utility isn’t a non-technical business user; complex enterprise software is complex because enterprises, and large government customers have gone beyond simple customer success departments, growth hackers and biz ops teams to instead needing to support operations across multiple countries, integrate with legacy systems you’ve never heard of and hopefully never will, and do everything according to strict sets of rules such as Sarbanes-Oxley and HIPAA that land you in jail if you don’t comply with them. These types of organization have very strict rules and procedures around how software is hosted, accessed and managed and vendors like Oracle know these customers’ needs well, who would in most cases prefer a complex but controllable installation process over one that hides all the details but potentially leaves them exposed to regulatory issues thereafter, meaning they can’t ever sign-off the provisioning process as complete and fully-compliant.

And one of the first lessons I learnt in product management is that whilst engineers and product managers prefer packages that give end-users all available functionality for one simple price, enterprise salespeople much prefer options.

PMs like packages, enterprise sales prefer options

With packages sold to enterprises with a single list price advertised — say, $1m all-in for the right to run everything on a single high-powered server — what happens in reality is that one or two customers pay that price, but the vast majority get some sort of discount so that everybody pays something and no sales are lost just because the customer couldn’t afford the list price.

What works far better when selling enterprise deals is when everything beyond a basic core set of functionality becomes an option and customers can then align what they want with what they can afford without the vendor giving everything away each time and massively-discounting deals to make them affordable.

Hence Oracle Analytics Cloud having both Standard and Enterprise Editions, and the Standard Edition having two variants based around just data visualization or just Essbase, and there being an upcoming Data Lake Edition that will include other net-new functionality, and presumably some time in the future other functionality being added as additional options that will need further license spend by the customer before becoming available to their end-users.

Just remember when you’re sitting in the engineering department mocking the salespeople that extracting the right price from the right customers whilst everyone feeling they’ve got a good deal is what b2b selling is all about, which is why those salespeople are about the only people who are paid more than the engineers in most startups.


So what does Oracle Analytics Cloud look like once you’ve uploaded or pointed it towards some data, and started to visualize and analyze it using the various tools available to you in the platform? Well it’s effectively Oracle Business Intelligence 12c with the Data Visualization Option, at least in the configuration I’d chosen; “Self-Service Data Visualization, Preparation and Smart Discovery” along with “Enterprise Data Models” which translates to Visual Analyzer plus Answers, Dashboards and a full RPD in old on-premise terms. There’s the new home page inherited from Visual Analyzer and a new console that allows you to perform admin functions, define and work with data sources and download the BI Administration tool if the online modeler is too simplistic for your project.

Oracle Analytics Cloud Console and Home Page

I was able to upload and then analyze some of my cycling stats from Strava, working out that Saturdays in September were my best time for logging the miles and working off some calories in the last twelve months.

Visual Analyzer within Oracle Analytics Cloud

I was also pleased to see that the classic Answers interface was also still available along with the original home page, recent and popular content and task menus.

Classic Answers interface from OBIEE12c within Oracle Analytics Cloud

And this gets to what the central appeal of Oracle Analytics Cloud really is; it’s full Oracle Business Intelligence 12c, Oracle Data Visualization and Oracle Essbase with all the configuration options, IT appeal and product capabilities that were built-up over the years based on the requirements of large-scale, sophisticated and demanding enterprise and government customers. It’s not Superset or Looker, it makes no sense to anyone who’s not already invested in the Oracle analytics and database ecosystems and it can meet just about any analytics requirement, and then some … and now it all runs natively in the cloud.

Sadly my trial account and promotional credits ran out after a couple of weeks but if Oracle are ever kind enough to extend that access sometime in the New Year I’d be glad to roll-up my sleeves, dust-down my old book on how it all works and give this new cloud incarnation of the product that defined my career for the best part of ten years a further look.