Posts tagged Analytics
Continuous Integration and Automated Build Testing with dbtCloud

If you read my recent blog post on how Rittman Analytics built our operational analytics platform running on Looker, Stitch, Google BigQuery and dbt, you’ll know that we built-out the underlying data infrastructure using a modular Extract, Transform and Load (ELT) design pattern, like this:

ra_analytics_architecture (1).png

We version-control our dbt development environment using git and, and do all new development beyond simple bug fixes as git feature branches, giving us a development process for dbt that looked like this:


1. We’d start by cloning dbt git repo master branch from to the developer’s workstation, which also would have dbt installed locally along with the Google Cloud SDK so that they can connect to our development BigQuery dataset.


2. Create a new, local git branch for the new feature using the git CLI or a tool such as Github desktop


3. Develop the new feature locally using the developer’s install of dbt, committing any changes to the feature branch in that developer’s local git repo after checking that all dbt tests have run successfully.

locals-imac:ra_dw markrittman$ dbt run --models harvest_time_entries harvest_invoices --target dev
Running with dbt=0.13.1
Found 50 models, 8 tests, 0 archives, 0 analyses, 109 macros, 0 operations, 0 seed files, 35 sources

20:53:11 | Concurrency: 1 threads (target='dev')
20:53:11 | 
20:53:11 | 1 of 2 START table model ra_data_warehouse_dbt_dev.harvest_invoices.. [RUN]
20:53:13 | 1 of 2 OK created table model ra_data_warehouse_dbt_dev.harvest_invoices [OK in 1.80s]
20:53:13 | 2 of 2 START table model ra_data_warehouse_dbt_dev.harvest_time_entries [RUN]
20:53:15 | 2 of 2 OK created table model ra_data_warehouse_dbt_dev.harvest_time_entries [OK in 1.70s]
20:53:15 | 
20:53:15 | Finished running 2 table models in 5.26s.

Completed successfully

locals-imac:ra_dw markrittman$ dbt test --models harvest_time_entries harvest_invoices --target dev
Running with dbt=0.13.1
Found 50 models, 8 tests, 0 archives, 0 analyses, 109 macros, 0 operations, 0 seed files, 35 sources

20:53:37 | Concurrency: 1 threads (target='dev')
20:53:37 | 
20:53:37 | 1 of 4 START test not_null_harvest_invoices_id....................... [RUN]
20:53:39 | 1 of 4 PASS not_null_harvest_invoices_id............................. [PASS in 1.50s]
20:53:39 | 2 of 4 START test not_null_harvest_time_entries_id................... [RUN]
20:53:40 | 2 of 4 PASS not_null_harvest_time_entries_id......................... [PASS in 0.89s]
20:53:40 | 3 of 4 START test unique_harvest_invoices_id......................... [RUN]
20:53:41 | 3 of 4 PASS unique_harvest_invoices_id............................... [PASS in 1.08s]
20:53:41 | 4 of 4 START test unique_harvest_time_entries_id..................... [RUN]
20:53:42 | 4 of 4 PASS unique_harvest_time_entries_id........................... [PASS in 0.83s]
20:53:42 | 
20:53:42 | Finished running 4 tests in 5.27s.

Completed successfully

4. All dbt transformations at this stage are being deployed to our development BigQuery dataset.

5. When the feature was then ready for deployment to our production BigQuery dataset, the developer would then push the changes in their local branch to the remote git repo, creating that branch if it didn’t already exist.


6. Then they’d create a pull request using Github Desktop and the Github web UI summarising the changes and new features added by the development branch.


7. I’d then review the PR, try and work out if the changes were safe to merge into the master branch and then accept the pull reques, and then overnight our dbtCloud service would clone that development git repo master branch and attempt to deploy the new set of transformations to our production BigQuery dataset.

And sometimes, if for whatever reason that feature branch hadn’t been properly build-tested, that scheduled overnight deployment would then fail.


So having noticed the new Build on Pull Request feature that comes with paid versions of dbtCloud, we’ve upgraded from the free version to the $100/month basic paid version and added an automated “continuous integration” build test to our feature branch development process so that it now looks like this:


To set this automated build test feature up, we first linked our Github account to dbtCloud and then created a new dbtCloud job that triggers when a user submits a pull request.


8. Now, when I come to review the pull request for this feature branch, there’s an automated test added by dbtCloud that checks to see whether this new version of my dbt package deploys without errors.


9. Looking at the details behind this automated build test, I can see that dbtCloud has created a dataset and runs through a full package deployment and test cycle, avoiding any risk of breaking our production dbt environment should that test deployment fail.


10. Checking back at the status of the test build in Github, I can see that this test completed with no issues, and I’m therefore safe to merge the changes in this feature branch into our development master branch, ready for dbtCloud to pick-up these changes overnight and deploy them as scheduled into production.


Docs on this recent dbtCloud feature are online here, and you’ll need at least the basic $100/month paid version of dbtCloud for this feature to become available in the web UI. It’s also dependent on Github as the git hosting service, so not available as an option if you’re using Gitlab, BitBucket or CodeCommit.

How Rittman Analytics does Analytics: Modern BI Stack Operational Analytics using Looker, Stitch, dbt and Google BigQuery

As well as creating modern BI stack analytics solutions for clients such as Florence, Let’s Do This and Colourpop using technologies from our partners at Looker, Stitch, Segment, Fivetran and Snowflake, it’s probably no surprise that we’ve used those same technologies and project experience to build-out our own internal analytics platform, covering data and operational analytics use-cases such as

  • Showing high-level KPIs that show progress towards our four main business objectives

  • Providing actionable context for these high-level KPIs and the ability to drill-into the detail behind them

  • Enabling ad-hoc querying and analysis of our complete dataset by technical and non-technical users

  • Providing a trusted, integrated analysis-ready data platform for other internal projects


I thought it might be of interest to go through how we’ve used the tools and techniques we use for client projects but in this case to deliver an analytics solution for our own internal use, sharing some of our thinking around how we defined our underlying KPI framework, the tools and design patterns we used when building out the data management part of the platform, and some of the front-end components and data visualizations we’ve put together using the Looker BI tool.

Defining your KPI Framework

When creating any sort of analytics solution the first thing you need to understand is - what are your business objectives; what do you need to measure to understand your progress towards those objectives; and how do you then define success?

Key Performance Indicators (KPIs) and measures within our analytics platform are based around what's needed to measure our progress towards four business objectives we set ourselves for FY2019:

  1. Increase Billing Revenue

  2. Increase Profitability

  3. Increase customer retention, and

  4. Increase our operational and delivery efficiency

In-practice, we “certify” data around these four primary KPIs and also a number of secondary measures, with data in the platform normally up-to-date as of the previous business day and with historical data going back to when we started the business.


Other metrics, measures and dimension attributes are provided alongside these certified KPIs and measures and are considered “best endeavours” in terms of our certifying their accuracy, and users then of course able to create their own calculations for use in reports and analyses outside of this formal curation.

Platform Architecture

At a high-level our internal analytics platform extracts data on a regular basis from the SaaS-based services we use to run our business operations, joins that data together so that each dataset uses the same definition of clients, products and other reference data, then makes that combined dataset available for analysis as a set of business KPIs and an ad-hoc query environment.


The design of our internal analytics platform was based around a modular BI technology stack, ELT ("Extract, Land and Transform") data warehouse loading pattern and modern software development approach that we use on all of our client analytics development engagements.

Data Pipeline and Source Connectivity

We’re digital-first, paperless and run our business operations using cloud-hosted software services such as:

  • Harvest for timesheets, project invoicing and client-rechargeable expenses

  • Harvest Forecast for project resource planning and revenue forecasts

  • Xero for accounting and payroll

  • Hubspot CRM for sales and contact management

  • Docusign for contract management

  • Jira for project management and issue tracking

  • BambooHR for human resources and recruitment

On the delivery side, Rittman Analytics partners with four different data pipeline-as-a-Service software vendors, each of which targets a different segment of the market and that segment’s particular use-cases:

  • Stitch, who focus on small to mid-market startup customers with data engineers who value service extensibility and their land-and-expand pricing model ($100/month for up to five sources, including Hubspot and Xero certified and Harvest community-supported connectors)

  • Fivetran, aimed more at mid-market to enterprise customers and the data analyst personas, with a more turnkey solution and increased levels of service and coverage of enterprise data sources

  • Segment, more of an "enterprise service bus" for event-level digital data streams, connecting multiple event producers with downstream event consumers and offering additional features around customer journey mapping, personas and schema validation

  • Supermetrics, aimed at growth hackers and digital marketers with a focus on simple self-service setup, comprehensive data dictionaries and a comprehensive catalog of advertising, marketing and social data sources delivered through a Google Sheets add-in or more recently, a direct connector into Google BigQuery through the GCP Marketplace.

We chose to use Stitch's service for our data pipeline for this internal project due to their out-of-the-box ability to source data from Harvest and Harvest Forecast; functionally both Fivetran and Stitch would meet our needs other than support for these two particular sources, and we've used Fivetran extensively on other internal and client projects in the past.


We also use Segment's event tracking service to track visitor activity on our website and land those raw tracking events also in our BigQuery project, and intend to make use of Segment's Personas product in due course to enable us to build out 360-degree views of our clients, site visitors and prospects based on their interactions across all of our various digital touchpoints (web, social media, inbound and outbound lead generation, chat etc)

Data Transformation, Data Integration and Orchestration

We transform and integrate raw data that Stitch syncs from each of our SaaS application sources into an integrated, query-orientated data warehouse dataset using the open-source dbt toolkit.

Once we'd decided to work with a data pipeline-as-a-service such as Stitch together with a SQL-based data management platform like Google BigQuery, the decision to transform and integrate our data via a series of SQL SELECT was the obvious next design choice; by using dbt and version-controlling our scripts in a Github repository we increased our productivity when developing these transformation, adhered to modern software design principles and avoided cut-and-paste scripting in-favour of templated, maintainable code.


Transformation steps in dbt are basically SQL SELECTs with references to metadata models replacing hard-coded table names and use of macros as reusable, often cross-database transformation components giving three main benefits over hand-written, author-specific code:

  • Increased analyst productivity through leveraging libraries of best-practice transformation components

  • More maintainable loading processes with concise logic that “doesn’t repeat yourself” (DRY)

  • Portable code that can be deployed on BigQuery today or Snowflake tomorrow if required

For example, as shown in the example code snippet below.

    FROM (
            {{ dbt_utils.datediff(start_date, end_date, 'day')}} +1 AS forecast_days,
            a._sdc_sequence ,
            MAX(a._sdc_sequence) OVER (PARTITION BY ORDER BY a._sdc_sequence 
            {{ ref('harvest_forecast_assignments') }} a
        INNER JOIN
            {{ ref('harvest_forecast_people') }} p
            ON person_id =
        _sdc_sequence = latest_sdc_sequence

In addition to the open-source dbt toolkit we also use the commercial dbtCloud service from our friends at Fishtown Analytics, mainly in order to then be able to execute our transformation graph in the cloud, like this:


dbtCloud also hosts the data dictionary automatically generated from the metadata in our dbt package, and diagrams the transformation graph and data lineage we’ve defined through the dependencies we created between each of the models in the package.


As well as standardising each data source and mapping each source's set of customer references into a single, master customer lookup table that allows us to analyze operations from sales prospect through project delivery to invoicing and payment, we also turn every customer touchpoint into events recorded against a single view of each customer, like this:


We then assign a financial revenue or cost value to each event in the customer journey, calculate metrics such as time between interactions (to measure engagement), value delivered to the client and use that information to better understand and segment our clients so as to provide the advice and service they're looking for at this particular stage in their lifecycle and relationship with our team.


Analytics and Data Visualization

We then use Looker as our primary analytics tool over the data prepared by dbt and stored in Google BigQuery. Everyone's homepage in Looker is set by default to show those four main operational KPIs with comparison to target and with the last six months’ trend history, providing the context when combined with Looker’s ability to go deep into the detail behind those headline numbers so that users can decide and action on what they’re seeing (note that I’ve altered and obfuscated our actual numbers in the next few screenshots)


We've also created a single business operations-wide Looker explore that enables users to analyze and explore all aspects of our sales, delivery and financial relationship with clients, suppliers, partners and prospects.

In the example below we've again using the event model and applying sequencing to those events bucketed by months since first billable day, to help us understand how the the demand decay curve looks for each cohort of clients we've taken on over the first year we've been operating.


Embedded Analytics using Powered by Looker

We also link fields in our Looker explore back to the online services that can provide more background detail on a metric displayed on the screen, in the example below linking back to Harvest to see details of the project displayed in the explore data visualisation.


Finally, for team members who are more likely to be working within our knowledge-sharing and internal company portal site, we use Powered by Looker embedded analyics within Notion, connecting our internal analytics metrics and insights directly into the productivity and workflow tools used by the team, bringing data and analytics to the problems they’re looking to solve rather than the other way around.


So, hopefully this look behind the scenes at how we use these modern analytics stack technologies for our own analytics needs has been useful, and feel free to leave comments or contact me at if any of this would be applicable to your business. In the meantime, the dbt and Looker git repos used to build-out our platform are available as public git repos on Github:

Feel free to fork, improve on what we’ve done, submit PRs or just see a bit more of the detail behind the examples in this post.

Event-Level Digital Analytics using Google Analytics, Fivetran, BigQuery and Looker

A few weeks ago I posted a blog on using Mixpanel, Looker and Google BigQuery to analyze listener data from the Drill to Detail website, with Mixpanel tracking individual episode play events by visitors to the site and Fivetran replicating that event data over to BigQuery for analysis using Looker.

Mixpanel is great but like most digital businesses using Google Analytics (GA) to analyze and measure activity on their website, we actually use Google Tag Manager and GA events to track visitor activity on the website. Whilst a full site license for Google Analytics 360 is way beyond our budget right now, the free version of GA together with GTM provides us with an unsampled, event-level stream of visitor events and together with goals and simple A/B testing tools a way to start optimizing the content on our site. 

Couple all of this with Fivetran and Looker and we can start to do some interesting analytics on all this data to not only count sessions, visitors and pages viewed but also start looking at visitor retention and the routes those visitors take through the various features on our website.

Events in Google Analytics are a way of recording interactions visitors to your site make with items on a page such as selecting an item on a navigation menu, clicking on a download link or seeing a special offer or product recommendation. On the website we’ve setup events, for example, to record how far down the page a visitor scrolls so we can see how many people actually scroll past the hero image on our front page and look at the content on Looker services, Oracle services and so on.

Untitled 2.001.png

Another event records clicks on the specific hero image a visitor clicks on with another recording what navigation menu item was clicked on, in each case storing the image name or menu item clicked on as the event label.

Untitled 2.002.png

Whilst you can record events being triggered by adding custom Javascript to your website’s HTML page definitions, we use GTM’s graphical point-and-click interface to set-up our events and the visitor actions that trigger them.

Untitled 2.005.png

Enabling GTM’s debug mode for our site shows us which events then trigger when visiting our site, with the screenshot below showing a typical visitor journey through our homepage and the events that are triggered as they scroll-through and interact with various items on the page.

Untitled 2.003.png

Data on events being triggered along with event label and event action values are stored in a Javascript data layer that can also receive values sent over, for example, by an eCommerce application so that clicks on a product catalog item can record the price on offer as well as what product was clicked on. 

Looking at the Data Layer view in Google Tag Manager you can see the event entries recorded for this visitor to the site page, but note also how nothing personally identifiable is recorded except for a unique client ID set at the device/browser level that can then be used to analyze new vs. returning visitors and track visitor retention and engagement over time.

Untitled 2.004.png

As we then did with the Mixpanel example and the Drill to Detail website, we then use Fivetran to replicate all of the data layer events over to Google BigQuery, storing this granular visitor behavioural data in a BigQuery table and dataset. As you can see from the screenshot below we use Fivetran extensively within MJR Analytics; replicating invoices and all our other accounting data from Xero into BigQuery along with timesheet data from Harvest, CRM data from Hubspot, commits and PRs from Github, tasks outstanding and completed from Asana and search rankings and keywords from Google Search Console all into a Google BigQuery project whose datasets we then combine and analyze together using Looker.

Untitled 2.006.png

For the event data coming across from GA we use Fivetran to replicate the relevant event tracking dimensions and metrics into BigQuery every five minutes, giving us near real-time numbers on what’s popular and what’s engaging on our company website.

Untitled 2.007.png

As the event data from GA arrives as table rows of individual events being triggered, I then sessionize those individual events into rows of visitor sessions, using BigQuery SQL to pivot the incoming data and create individual columns for the first event, second event, first page view and first product item clicked on, for example, within an individual visitor session. The actual SQL used for the view is too long really to include here but to get the idea, here’s a condensed version with repeating parts removed for clarity.

  MAX(visitor_session_length_mins) AS visitor_session_length_mins,
    MAX(e.visitor_session_event_seq_num) OVER (PARTITION BY e.global_session_id) AS session_events_count,
      global_session_id = e.global_session_id
      AND event_category = 'Page View') AS page_view_count,
      global_session_id = e.global_session_id
      AND event_category = 'Nav Bar Clicks') AS nav_bar_click_count,
    e1.event_label AS event_1_label,
    e2.event_category AS event_2_category,
    e2.event_label AS event_2_label,
    pv.event_label AS page_view_1,
    pv2.event_label AS page_view_2,
    igc.event_label AS index_gallery_first_click
    `aerial-vehicle-148023.dw_staging.ga_events` e
      ROW_NUMBER() OVER (PARTITION BY global_session_id ORDER BY visitor_session_event_seq_num) page_view_seq_num
      event_category = 'Page View'
      3) pv
    e.global_session_id = pv.global_session_id
    AND pv.page_view_seq_num = 1
      ROW_NUMBER() OVER (PARTITION BY global_session_id ORDER BY visitor_session_event_seq_num) page_view_seq_num
      event_category = 'Page View'
      3) pv2
    e.global_session_id = pv2.global_session_id
    AND pv2.page_view_seq_num = 2
    `aerial-vehicle-148023.dw_staging.ga_events` e1
    e.global_session_id = e1.global_session_id
    AND e1.visitor_event_seq_num = 1
    `aerial-vehicle-148023.dw_staging.ga_events` e2
    e.global_session_id = e2.global_session_id
    AND e2.visitor_event_seq_num = 2

Creating a view or materialized table from this BigQuery SQL then gives me one record per session with event totals, page view and event paths along with session and client IDs that are perfect then for bringing into Looker for further analysis.

Untitled 2.008.png

Then finally it’s just a case of bringing that table of data into Looker as a LookML view and using one of Looker’s custom visualization blocks to display visitor event paths as a Sankey diagram, for example.

Untitled 2.009.png

Or we can aggregate and analyze counts of events from sessions to see in graphical form which offers, download items and navigation menu items are most popular over a given period.

Untitled 2.010.png

If you’d like to understand more about how MJR Analytics can help you better understand what visitors are looking at and interacting with on your website then drop me an email at; or even better, if you’re in Copenhagen on November 22nd and have the evening free, come along to the free meetup we’re hosting together with our partners CIMA and Looker and you’ll see me demo this all live, in person and show you how it works.

Introducing MJR Analytics … and How Two Years Go So Fast When You’re Learning Something New

Today I’m excited to be launching MJR Analytics, a new consulting company focusing on modern, cloud analytics projects using technology from Looker, Qubit, Fivetran, Oracle and Snowflake and development techniques learnt from my time working as an analytics product manager at a startup in London.


Our new website (and no that’s not me sitting in the chair)

So what have I been up to in the two years since I left my old consulting company, and how has that experience and the way I’ve been working with analytics technologies over that time inspired me to start another one?

Two years ago I announced on Twitter that I’d left the company I’d co-founded back in 2007 and intended to now take on a new challenge, and then spent the rest of the week at Oracle Open World cycling over the Golden Gate Bridge and coming back on the ferry and trying to decide what that challenge might actually be.


For most of my time in the IT industry I’d been working on projects implementing products from vendors such as Oracle and I’d always been interested in how these products came to market, how software vendors came up with a strategy and roadmap for those products how the team behind those products worked with the engineers who built them.

I’d also become increasingly interested in the startup world and towards the end of time time at Rittman Mead had taken-on an informal role advising Gluent, Tanel Poder and Paul Bridger’s product startup who were building software that enabled big enterprise customers to offload their data warehousing workloads from expensive proprietary databases onto to cheap, flexible storage and processing running on Hadoop clusters.

What appealed to me about working more formally with Gluent was the opportunity it gave me to work with two smart founders and an even smarter development team developing a product built entirely on big data technology I’d until then only scratched the surface with on consulting gigs. The product marketing role I took on was all about establishing what market that product was primarily intended for, how we went about positioning the product to appeal to that market and how we then brought that product to market.

Understanding these four things are crucial if you’re going to actually get customers to buy your startup’s product:

  • who is the buyer

  • what problem does your product solve

  • what is the value solving that problem, and

  • why you’re the first product to solve it for them

otherwise you’ll spend your time building a solution to a problem that nobody actually has, and that’s the reason the majority of tech startups end-up failing. Solving a problem for a market that’s willing to pay you money to solve is called “product/market fit” and if your product has it, and you’ve built your business such that it scales linearly as more customers discover your product, then you’re going to make a lot more money than a consultancy constrained by how many hours in the week your consultants can work and the upper limit on how much you can charge for a single person’s time.

I also learnt the distinction between product marketing, product management and product development in my time at Gluent. Going back to my time as a consultant attending product roadmap sessions at conferences I never quite knew which parts of the product team those speakers came from, but in summary:

  • Product Marketing is about taking a product that’s typically already built and then deciding the product’s positioning and messaging, then launching the product and ensuring the sales team, sales engineering and customers understand how it works and what it does; as such, this is a marketing role with a bit of technical evangelism thrown in

  • ProductDevelopment is the actual building of the product you’re looking to sell, and requires an engineering skillset together with the inspiration that typically came up with the product idea in the first place along and an entrepreneurial side that made you want to build a company around it

  • Product Management is more of a customer-facing role and is about understanding what your customers want and what their problems and use-cases are, and then creating a strategy, roadmap and feature definition for a product that will meet those needs

Despite my undoubted product marketing skills based around PowerPoint and internet memes:


In product marketing, it’s never too soon to put a Santa hat on a photo of the founder

in the end it I realised that it was product management that interested me the most and, after a couple of meetings with an old friend who used to run product management at Oracle for their business analytics product line and who had recently moved to London and now lead the product team team at Qubit, a technology startup created by four ex-Googlers building marketing technology products based-around Google’s big data and cloud technology, I joined their team later in 2016 as product manager responsible for the analytics features on their platform.

I spoke about the Qubit and the partnership we established with Looker back in May last year at a presentation at Looker’s JOIN 2017 conference in San Francisco and the slide deck below from that event goes into the background to the product and the problem it solves, helping customers using Qubit’s personalization platform make more effective use of the data we collected for them.

The product and data engineering teams at Qubit did an excellent job bringing together the features for this product and in hindsight, the bits I was most proud of included:

  • The business metadata layer we created on Google BigQuery and Google Cloud Platform to translate an event-level normalized many-to-many data model designed for fast data ingestion into a denormalized, dimensional data model designed for easy use with BI and ETL tools

  • Additional integration we created for the Looker BI tool including a set of industry vertical-specific Looker models and dashboards we then made available on the Looker Block Directory and in a Github public repo


Screenshot from Personalization Analytics Block for Looker by Qubit

  • The multi-tenant data warehouse and hosted Looker instance we then put together to enable customers without their own Looker instance to make use of their data in Google BigQuery, doing so in a way that supported per-tenant extensions and customizations by the customer or their implementation partner.


Technical Architecture for Live Tap as presented at Looker JOIN 2017

What I’ll take-away from my time at Qubit though was the incredible amount that I learnt about product management, product engineering, how to build and run a successful startup and team who are still highly-motivated seven years in and the introduction it gave me to the analytics and data-led world of digital marketing, eCommerce and modern data analytics platforms.

Consulting is a popular route into product management and the experience I brought to the role in areas such as business metadata models, analytical techniques and the needs of BI and ETL developers proved invaluable over the eighteen months I worked as part of Qubit’s product and engineering teams, but moving into product management within a young, technology-led startup founded by ex-Googlers and working with some of the smartest and most innovative people I’ve ever met involved learning a whole new set of skills including:

  • Developing on a new technology platform (Google Cloud Platform) within a new industry (eCommerce and digital marketing) and understanding a whole new set of analytics use-cases and customer roles (A/B testing, stats models and event-based analytics used by analysts and strategists within eCommerce businesses) that I described in a presentation at last year’s UK Oracle User Group Tech Conference in Birmingham:

  • Working as part of a team rather than directing that team, and managing -up as well as down, a technique I had to relearn pretty quickly in my first few months in the role

  • Learning to achieve my goals through influence rather than in the top-down way I’d been used to getting things done leading customer projects, and as CTO and owner of the company that team worked for

  • Saying no to customers rather than yes as you did as a consultant, as your objective is to build a product that solves the most important customer needs but doesn’t burden you with so many features addressing niche use-cases that you end up with Homer’s car and can’t innovate the product in future releases

  • How to take a product through its lifecycle from identifying a need that makes sense for your company to meet, through prototyping, alpha and beta releases to successful first launch and then creating a strategy and roadmap to manage that product over its complete lifecycle

  • How to use a new generation of modern, cloud-native data analytics tools such as Looker together with products such as FiveTran, Google Cloud Platform, Qubit, Snowflake DB and Snowplow Analytics that were increasingly also being adopted by the FinTech, MarTech and B2C startups clustering in London and other European/North American tech hubs

I learnt so much from my colleagues at Qubit about products, engineering and building a successful and motivated team that put up with my jokes and built the most technologically-advanced marketing personalization platform on the market.

But what my time at Qubit also made clear to me was that, when it came down to it, what really motivated me to get up in the morning, learn all these new technologies and still be wildly excited to come into work in the morning twenty years later was:

  • using data and analytics to find new insights and uncover new opportunities in a customer’s data set

  • working with that individual clients, over time, to enable them to find more of those insights and opportunities themselves

  • find new innovations in analytics technologies and how we deliver projects to make this process cheaper, faster and more likely to be successful

  • and building a team, and crucially a business, to do all of this at scale and offer a full set of analytics-related consulting services built around modern analytics tools and delivery techniques

Which is why after two years away from the consulting business and two enjoyable, rewarding and enlightening years working on the other side of the data and analytics industry I’m now launching my new consulting company, MJR Analytics; and I hope to be working with many of you as clients or members of our team over the coming months and years.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.3

In this series of three blogs on Oracle Analytics Cloud Data Lake Edition I’ve setup an object store data lake in Oracle Cloud using Oracle Big Data Cloud and Oracle Storage Cloud, and ingested streams of real-time event data from IoT and social media sources into Oracle Cloud’s object storage service using Oracle Event Hub Cloud Service.

The event-stream data I staged into Storage Cloud was then copied into parquet files on HDFS and then presented out to BI and ETL tools through Big Data Cloud’s Thrift Server interface, so that now I’m ready, after a short diversion into defining the data engineer role that would typically work with this new product edition, to start exploring some of Oracle Analytics Cloud Data Lake Edition’s new data flow and predictive model preparation features.

The diagram below shows where OAC Data Lake Edition fits into my project architecture, performing the tasks of transforming and enriching the incoming dataset and then presenting my at-scale data out to end-users for analysis using OAC Data Lake Edition’s Data Visualization features.

Looking at the homepage within OAC Data Lake Edition I can see my two Hive tables listed within the dataset catalog, alongside other datasets I’d uploaded directly into OAC. This visual catalog of available datasets is also the new homepage interface that OBIEE12c now adopts, with both cloud and on-premises versions of Oracle’s BI tools now relegating the old “Answers” homepage to something you have to dig around and specifically look for in favour of this more self-service Data Visualization starting page.

I’ll have to write an article on Answers and how powerful its interface is, and the full dimensional model it exposes from the Oracle BI Repository, in a blog post sometime in the future as it’s almost in danger of getting forgotten about.

Moving on though, the first transformation I need to do on all the incoming datasets is to take the timestamp column in each table and convert it to a format that OAC recognises as a valid TIMESTAMP datatype format, then convert those columns to TIMESTAMPs so that DV can automatically enable time-series analysis by day, month, quarter, hour and so on. I do that using a feature that’s also present in OAC Standard Edition, the lightweight data preparation interface that’s presented to users when they first add a new data source into OAC’s dataset catalog, shown in the screenshots below.

Where OAC Data Lake Edition gets really interesting right now both in terms of differences vs. the on-premises versions of OBIEE I used to use, and in terms of it’s “data engineering” potential, is with a feature called Data Flows.

Most self-service BI tools now have a basic data loading and data preparation capability today with Tableau Data Prep being one of the latest examples. Designed to handle more complex data prep use-cases than basic datatype changes and field-splitting, they give end-users the ability to do this type of work themselves rather than trying to do it in Excel or handing the work off to the IT department and having to wait days or weeks to get that data back.

Data Flows are a feature that’s been introduced since the original on-premises version of OBIEE12c that I last used when working out in consulting, and provide you with what’s effectively a lightweight, multi-step ETL tool that executes transformations using the BI Server’s Model Extension feature, introduced back when OBIEE12c first came out as the mechanism to enable on-the-fly data mashups between server-side and user-uploaded datasets.

Looking at the transformation operators available in OAC Data Lake Edition v4 there’s quite a few that apply to data lake and data engineering-type workloads including running Python statistical analysis scripts and predictive model training and model build; there’s also an operator for creating an Essbase Cube, with Essbase in this instance positioned as a fast ad-hoc analysis back-end for use with the data visualization part of OAC.

For now though there’s two transformation tasks I want to do with my Hive datasets; first, enrich the incoming social media data by analyzing the sentiment in each tweet and then writing the data plus this sentiment tagging back to the Oracle Big Data Cloud environment, so that I can then turn those sentiment tags into a score and create a data visualization showing who sends me the most tweets and how crazy they are overall.

The second data enrichment I wanted was on some Strava cycling workout data I’d uploaded directly into OAC using the CSV file upload facility; using the model train and build Data Flow operators I defined a model to predict how many “kudos”, the Strava equivalent to Facebook “likes”, I’d get for a given cycle workout with a number of different variables available to the model in order to make the prediction — for example, distance and elevation gain, map location, effort expended and so on.

Then, after running the model build step and looking at the predicted values and the actual ones for the remainder of the dataset not used for model training, you can see the predicted kudos values are fairly in-line with the ones I actually recorded for those rides.

Another feature that’s now in Oracle Analytics Cloud is automated data diagnostics, or Explain. Explain uses machine-learning libraries and that same model extension/XSA BI Server framework to help users quickly understand the value distribution and statistically correlated driving factors for a particular dataset, and learn which segments or cohorts have the highest predictive significance. Enabled by a bunch of extensions to BI Server logical SQL I used the feature first on the sentiment scoring I’d performed earlier on, and then on the steps data I’d brought into Oracle Big Data Cloud from my Fitbit device, after converting the numeric step counts into a text attribute by bucketing its values into low, medium and extreme bucket values.

This is pretty powerful stuff, with automated understanding and context-gaining about new datasets being one of the most user-enabling features I’ve seen arrive recently in BI tools with the best example of this being BeyondCore, now part of Salesforce Einstein. OAC lets the user pick the most useful of the Explain facts and driver insights and publish them to a Data Visualization dashboard like the one below, showing the most predictive and significant variables in my dataset that influence the steps I take each day.

Which leads neatly to the final “data at-scale” feature in OAC, the Data Visualization feature that in my case is querying the ingested, transformed and now enriched datasets I’ve got running on my Oracle Big Data Cloud instance alongside Oracle Event Hub Cloud and Oracle Analytics Cloud Data Lake Edition.

Thank you once again to the Oracle ACE Director program for providing access to Oracle Analytics Cloud Data Lake Edition, Oracle Big Data Cloud and Oracle Event Hub Cloud services over the past few weeks. If you’re looking to try these new services out there’s free trials available for most of Oracle’s Cloud products and many of the new features are also available in Oracle Data Visualization Desktop 12c and Oracle Business Intelligence 12c, both of which can be downloaded for training and evaluation under the OTN license scheme.

Wrapping-up this three part series on Oracle Analytics Cloud Data Lake Edition and Oracle Big Data Cloud I’d like to go back to the two (serious) questions I asked myself at the end of the previous post:

  1. Has OAC Data Lake Edition got anything actually to do with data lakes, and is it a useful tool for aspiring Oracle technology data engineers?
  2. How does it compare to my old favourite Oracle big data product Oracle Big Data Discovery, officially still available and not quite dead yet but existing in some strange zone where the on-premises version stopped getting updates a while ago and the cloud version is for sale but you can’t buy it unless you know the right person to ask and he’s actually gone to Cloudera

So has Oracle Analytics Cloud Data Lake Edition got much to do with actual “data lakes”? Well … it integrates with Oracle Big Data Cloud and apparently comes with an option to run those data flow transformation in Big Data Cloud’s Apache Spark environment, though to be fully-transparent I didn’t see that as an option when doing my evaluation so can’t comment on how well or not that works.

Like Oracle Big Data Discovery before it, OAC Data Lake Edition makes you structure your incoming event stream data with Hive table metadata before you can work with it, but that’s actually fairly standard practice with most data visualization tools that work with Hadoop and data lake environments.

Having Essbase in this product package, alongside the data lake functionality, did make me scratch my head a bit and wonder, “why?” — data lakes and Essbase are about as opposite as you can get in terms of target users and use-cases and I think this Data Lake Edition is as much about creating a product package and price point that’s mid-way between the OAC Standard and Enterprise Edition.

But there is some logic to having Essbase in this edition; it provides a set of easy-to-use loading and preparation tools for Essbase making it easier for customers new to that product to start using it, and Essbase with its latest hybrid ASO/BSO storage format is surprisingly scalable and blindingly-fast to query, a great potential back-end for enabling data analysis “at-scale” using Oracle Analytics Cloud’s data visualization features.

I also get the feeling that this initial v4 version of OAC Data Lake Edition is more of an initial first-cut release to get something out to customers, establish the product package and validate the roadmap and market assumptions. Oracle Analytics Cloud v5 isn’t too far off and I’d expect incremental improvements and new features in areas such as natural language processing and machine learning built-into the developer experience; I wouldn’t be surprised to see Oracle Big Data Preparation Cloud making its way into the product given its obvious fit and overlap with Data Lake Edition’s data prep features.

But where I really see an interesting future for OAC Data Lake Edition is when it starts to integrate product features and the development team from Oracle’s recent acquisition of Sparkline Data.

I came across SNAP, Sparkline‘s platform for building OLAP-style dimensional models over data lakes and cloud object storage layers about a year ago when researching analytics platforms at Qubit and quite frankly, it’s as revolutionary in terms of todays data lake analytics market as OBIEE (or the nQuire Server) was back in 1995 with its virtual data warehouse over application and data warehouse sources.

Take these two slides from the Sparkline website and imagine them as the future of Oracle Analytics Cloud analyzing event-streams and big data cloud-hosted datasets…

and you can see why I’m keen to see where Oracle Analytics Cloud Data Lake Edition goes over the next couple of years. I’m speaking on OAC Data Lake Edition at ODTUG KScope’18 in Orlando in just a couple of weeks time so come along if you’re there, it should be an interesting talk.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.2

A couple of weeks ago I posted a blog on Oracle Analytics Cloud Data Lake Edition and the Oracle Cloud big data products you’d typically use to create a cloud-hosted, object storage data lake. In this and the final post in the series I’m going to look at some of the data transformation and predictive model building features that come with the Data Lake Edition of OAC and extend the base capabilities in this area that OAC Standard Edition comes with, and the data engineering role that uses the data lake this Oracle Analytics Cloud product packaging option is designed for.

As a reminder, OAC Standard Edition is more-or-less Oracle Data Visualization running as a single-tenant cloud-hosted VM (cluster) that’s either customer managed, or Oracle-managed if you go for the new Autonomous Analytics Cloud that interestingly is charged at a lower hourly rate than the customer-managed version; presumably it costs Oracle less to run OAC when it’s doing all the administration rather than letting customers control the settings, which makes sense.

It does make you wonder when all this VM nonsense will go away though as well as the distinction between OAC and the DVCS/BICS managed service cloud analytics products, with all of these services just becoming packaging options for a multitenant service partitioned by customer instance as the likes of Looker and Google Cloud Platform do today.

But for now, Oracle Analytics Cloud Data Lake Edition is described on the product homepage as having the following high-level capabilities compared to OAC Standard Edition:

  • Accelerated analysis via Essbase
  • and support for creating scalable, shareable high performance data flows.

whilst the product documentation states that it includes all the OAC Standard Edition features, plus:

  • Explore and replicate data
  • Create scalable data flows including ingestion, transformation, and persistence
  • Execute data flows by using Apache Spark or Oracle Database Cloud

with two edition-specific scenarios listed:

  • Explore Data Lakes, described as “You can create data flows to visualize data in your data lakes”, and
  • “Replicate Data in Data Lakes”, explained as “You can replicate data from Oracle Service Cloud (RightNow) into Oracle Big Data Cloud Service, Oracle Database Cloud, and Oracle Database. For example, you can replicate data from customer service records in Oracle Service Cloud into your data lake on Oracle Big Data Cloud Service and then perform further data preparation and analysis with data flows and data visualization projects.”

I’m guessing most readers won’t have heard of Oracle Service Cloud but for Oracle Cloud customers it’s quite a useful initial integration, as Service Cloud is Oracle’s cross-channel digital customer call centre application that brings together customer service interactions across web, social media and offline service channels. More interestingly though and you may have missed it back at the start but Data Lake Edition is the edition that ships Essbase, Oracle’s multi-dimensional OLAP server, in what almost seems like an afterthought right now but will become, I suspect, a more significant defining feature of this product package as time goes on.

For the purposes of this series of blog posts though, what I’m really interested in are the data engineering features in OAC Data Lake Edition — but before I start looking into what they might be, I guess I’d better explain what I mean by a data engineer.

I first came across the term data engineer in a blog written by Curt Monash — still the #1 person I’d love to have come on the Drill to Detail Podcast, if you happen to read this Curt — where he wrote:

“I learned some newish terms on my recent trip. They’re meant to solve the problem that “data scientists” used to be folks with remarkably broad skill sets, few of whom actually existed in ideal form. So instead now it is increasingly said that:

- “Data engineers” can code, run clusters, and so on, in support of what’s always been called “data science”. Their knowledge of the math of machine learning/predictive modeling and so on may, however, be limited.

- “Data scientists” can write and run scripts on single nodes; anything more on the engineering side might strain them. But they have no-apologies skills in the areas of modeling/machine learning.”

I think it’s probably true to say that the term really started to enter people’s consciousness, at least the sorts of people I hang around with, when Maxime Beauchemin from Airbnb wrote his seminal blog “The Rise of the Data Engineer” where he introduced this new discipline as:

“Like data scientists, data engineers write code. They’re highly analytical, and are interested in data visualization.

Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science.

In relation to previously existing roles, the data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.

In smaller companies — where no data infrastructure team has yet been formalized — the data engineering role may also cover the workload around setting up and operating the organization’s data infrastructure. This includes tasks like setting up and operating platforms like Hadoop/Hive/HBase, Spark, and the like.”

Maxime’s blog struck a chord with me as that’s pretty much how I’d describe what I do, or at least it’s the work I most enjoy doing; building big data systems to support BI and data warehousing workloads where my knowledge of how Hadoop, cloud and distributed data platforms complements the knowledge that other colleagues have around data science, machine learning and statistical analysis though importantly, I know enough about those areas to be able to deploy the models they build (or as we say over in the UK at least, I know enough about them to be dangerous).

I interviewed Maxime about his blog and the topic of data engineering on the Drill to Detail Podcast last May and if you’ve not done so already I’d recommend you give the episode a listen; for me though, having just spent the last eighteen months working in the London tech startup scene, my definition of a data engineer would be someone who’s just out of university and has an ironic beard, thinks he’s made ETL tools obsolete but is slowly, painfully, recreating one and is now just on the brink of discovering code templating, deployment contexts, change control and data lineage…

… but earns twice what you earn and gets all the girls. Because, data engineers are even more essential than data scientists if you’re serious about building a data lake, and data lakes are the new data warehouses.

Which bring us back to Oracle Analytics Cloud Data Lake Edition and these three blog posts. The first part of this three part series talked about Oracle Big Data Cloud and products such as Oracle Event Hub Cloud — Dedicated (who names these products at Oracle? Seriously?) and in this second I seem to have got rather diverted by trying to define what Oracle Analytics Cloud Data Lake Edition actually is and then even more so by defining the supporting role of a data engineer; in the final post tomorrow I’ll take a look at some of the data transformation, model definition and deployment and data engineering features in OAC Data Lake Edition and then, finally, conclude by asking myself two questions — OK, three:

  1. Has OAC Data Lake Edition got anything actually to do with data lakes, and is it a useful tool for aspiring Oracle technology data engineers?
  2. How does it compare to my old favourite Oracle big data product Oracle Big Data Discovery, officially still available and not quite dead yet but existing in some strange zone where the on-premises version stopped getting updates a while ago and the cloud version is for sale but you can’t buy it unless you know the right person to ask and he’s actually gone to Cloudera
  3. Will the ironic beard and quiff be enough, or do I need to go full ginger if I’m going to really be taken seriously as a data engineer?

Final part in the series on Oracle Analytics Cloud Data Lake Edition comes later this week.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.1

Some time ago I posted a blog on what analytics and big data development looked like on Google Cloud Platform using Google BigQuery as my data store and Looker as the BI tool, with data sourced from social media, wearable and IoT data sources routed through a Fluentd server running on Google Compute Engine. Overall, the project architecture looked like the diagram below…

… and I’ve got more-or-less the same setup running right now, with an additional GCE VM running Confluent Open Source to feed a subset of the event streams into a Druid cluster that I’m using to test out Looker, Superset and Imply for sub-second ad-hoc query analysis use-cases. More on that soon.

If you’re a regular reader of this blog you might recall a week or so ago I posted a blog on the new releases of Oracle Business Intelligence Enterprise Edition (OBIEE), Oracle Analytics Cloud (OAC) and Oracle Data Visualization Desktop (DVD) and mentioned a new packaging option for their cloud analytics product, Oracle Analytics Cloud Data Lake Edition. I’ve got a particular interest in what this product might be as I used to use the product it replaces, Oracle Big Data Discovery (BDD), fairly extensively in Oracle big data analytics and data lake projects a few years ago.

And Oracle Big Data Discovery was — technically at least — a great product. It combined the search and analytics features of Endeca Information Discovery with the scale and data transformation abilities enabled by Apache Spark and Hadoop, but suffered perhaps by being a bit ahead of the market and by not having any obvious integration with the rest of Oracle’s analytics and data management tools. By contrast Oracle Analytics Cloud Data Lake Edition is one of three packaging options for Oracle Analytics Cloud and includes all of the functionality of OAC Standard Edition (Oracle Data Visualization together with basic data preparation tools) as well as itself being a subset of the wider set of analysis, dashboarding and enterprise data modeling features in OAC Enterprise Edition.

An equivalent product architecture for ingesting, transforming and analyzing my IoT, wearables and social media data in Oracle Cloud would look something like the diagram below, with the following Oracle Cloud Platform-as-a-Service (PaaS) products used for ingest, storage and analysis:

  • Oracle Event Hub Cloud Service: Apache Kafka running either customer or Oracle-managed with full access to Kafka’s REST Proxy and Kafka Connect
  • Oracle Big Data Cloud: Oracle’s new elastically-scalable Hadoop platform running Apache Spark, Ambari and other Hortonworks Data Platform components
  • Oracle Analytics Cloud Data Lake Edition: Oracle Data Visualization combined with more extensive lightweight ETL (“data flow”) components, text analytics and machine learning model training and build capabilities

In this example I’m using Apache Hive and Parquet storage as my column-orientated data platform but of course I’ve now also got Oracle Autonomous Data Warehouse Cloud as an option; I’ll stick with Hive on Oracle Big Data Cloud for now though as this gives me the option to use Apache Spark to transform and wrangle my data and for building machine learning models using SparkML and, via pySpark, Python Pandas. In what’s the first of two posts in this short series I’ll be looking at how the data pipeline is set up, and then in the second post I’ll look at Oracle Analytics Cloud Data Lake Edition in detail focusing on the data transformation, data engineering and data science features it adds beyond OAC Standard Edition.

The development environment I put together for this scenario used the following Oracle Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) components:

  • Oracle Compute Classic and Storage Classic Services
  • Oracle Database Cloud Service, with the 11g database option
  • Oracle Event Hub Cloud Service Dedicated, with Kafka Connect and REST Proxy nodes
  • Oracle Big Data Cloud, single node with Hive, Spark 2.1, Tez, HDFS, Zookeeper, Zeppelin, Pig and Ambari
  • Oracle Analytics Cloud Data Lake Edition with Self Service Data Preparation, Visualisation and Smart Discovery (aka Oracle DV)

The screenshot below from the Compute Classic Service Console shows the various PaaS VMs running on the compute IaaS layer, with other tabs in this console showing network, storage and other infrastructure service usage.

The order in which you install the services is important if you want to associate the various products together correctly, and if like me you’re using a trial account you’ll need to plan carefully to ensure you keep with the various quota limits that Oracle Cloud imposes on trial identity domains that stopped me, for example, allocating the usual two OCPUs to the main Oracle Event Hub Kafka server if I wanted to run the rest of the stack at the same time.

Associating two services together, for example Oracle Event Hub and Oracle Big Data Cloud, connects the two services together automatically in the identity domain network and makes using them together much simpler but unless Event Hub Cloud is provisioned and available when you come to install Big Data Cloud you can’t go back and associate them afterwards, but more annoyingly if you decide you want to use Event Hub Cloud and associate it with Big Data Cloud but since then you’ve provisioned Oracle Analytics Cloud Data Lake Edition and associated Big Data Cloud with that you have to unwind the whole provisioning process and start again with Event Hub Cloud if you want to connect them all together properly. And forget deleting that Database Cloud Service you associated with Oracle Big Data Cloud and then forgot about as you can’t delete services that other services are associated with.

Provisioning each of the services involves giving the service instance a name, assigning storage buckets and OCPU counts to the various cluster nodes you request, and at key points selecting previously provisioned and now running services for association with the one you’re now provisioning. The screenshots below show the three-stage provisioning service for Event Hub Cloud Service — Dedicated:

Provisioning a new Event Cloud Service — Dedicated cluster

and the order in which I provisioned my data lake services, and the important options I chose to make it all work together and within quota, were as follows:

  1. First ensure you have access to Oracle Cloud, and then the various IaaS services: Oracle Cloud Infrastructure Compute Classic, Oracle Cloud Infrastructure Object Storage Classic and Oracle Identity Cloud Service
  2. Provision Oracle Database Cloud Service to store the various Fusion Middleware RCU schemas; in my case I chose Oracle Database 11gR2 as the database type as it avoids the complexity around CDBs and PDBs you get with the 12c database release
  3. Then provision Oracle Event Hub Cloud — Dedicated with one OCPU for the main Kafka VM, one for the REST Proxy VM and another one OCPU for the Kafka Connect VM. For a real deployment you’d want at least two OCPUs for the Kafka service VM but using just the one kept me within my overall OCPU quota limit when installing the rest of the stack
  4. Next step is to provision Big Data Cloud with a single node with the minimum 2 OCPUs, the Full deployment profile and version 2.1 of Apache Spark as that’s the version OAC Data Lake Edition insist on in the next step. When prompted, choose the option to associate Event Hub Cloud and Database Cloud with Big Data Cloud as you won’t get the option to do this again after the initial service provision; once provisioned, open-up the TCP port for Ambari (8080) to the public internet so that OAC in the next step can associate with it — provisioning for OAC failed for me every time until I looked through the provisioning logs and spotted this as the issue
  5. Finally, provision Oracle Analytics Cloud and choose Data Lake Edition as the package option, again in my case assigning a single OCPU and selecting Data Lake Edition as the software version

At that point if you then bring up the Cloud Services dashboard and review the services together for the first time, it’ll look something like this:

Oracle Data Lake stack within Cloud Services Dashboard

Now it’s time to ingest some data and land it into Oracle Big Data Cloud.

The streaming IoT, wearables and social media comms data that I’ll be ingesting into Big Data Cloud will be coming in from the public internet over TCP, and I’ll also want to connect to Event Hub Cloud from my desktop using tools such as Kafka Tool so an addition configuration step I’ll do before setting up anything else is to open-up Event Hub Cloud’s Kafka broker endpoint to the public internet using the Access Rules menu item in the Event Hub Cloud console.

Now I can see the Kafka service and the default topics that Event Hub Service created for me in Kafka tool.

I can either then use Kafka tool to create a new topic to start receiving the first of my data streams, the IoT device event data coming out of Samsung SmartThings, or create the topic as by defining a new Event Hub Cloud service from within the Event Hub Cloud Service — Dedicated console (confusing, but that’s how Kafka topics are named within Event Hub Cloud)

Then it’s just a case of directing the stream of IoT event data to the public Kafka broker endpoint exposed by Event Hub Cloud Service — Dedicated and then, after a short while, checking the metrics for the new Kafka topic that I setup to receive this incoming streaming data.

Getting the data off the Kafka topic and into a Hive table on the Big Data Cloud instance involved the following steps, using Oracle Cloud Infrastructure Object Storage Classic as the intermediate staging layer together with Event Hub Kafka Connect’s OCS Sink Connector:

  1. Configure Event Hub Kafka Connect OCS Sink Connector to push topic events to Oracle Cloud Infrastructure Object Storage Classic (OCS)
  2. Using Zeppelin notebook provided by Big Data Cloud Console, create a CRON job that copies those events across to HDFS storage
  3. Create Hive external tables with location clauses that point to the directories I’ve copied the event files into from OCS

Then, when I go and log into OAC Data Lake Edition and connect to the Hive Thrift Server on the Big Data Cloud instance I can see the Hive tables I’ve just created, and the data that’s now streaming through from my two initial sources via the Kafka service and Kafka Connect running on Event Hub Cloud Service — Dedicated.

In the second half of this two-post series I’ll go deeper into OAC Data Lake Edition and see how its additional transformation and analysis capabilities stack-up against OAC Standard Edition, and also see how it compares to the Oracle Big Data Discovery tool its looking to eventually replace.

Using Looker Data Actions to Make Monzo Spend Analysis More Interactive … and Actionable

The other week I posted a blog about analyzing the bank transaction data Monzo make available to customers via their developer API using Looker, the BI tool I use day-to-day in my role at Qubit as Analytics Product Manager.

One of the dashboards I’ve created shows me how spend looks across a number of merchant categories, compares their average transaction values and higlights the one I visit the most (The Ginger Dog in Brighton, thoroughly recommended) and plots each of the them on a map so I can remember where they’re each located.


If I click on one of the map icons on the Spend by Location visualization Looker then pops-up a dialog showing me the individual column values for this particular merchant — the Hardwood Arms in Fulham, one visit and total/average spend amounts of £75 exactly.


As well as tagging your transactions with metadata from the Foursquare web service Monzo also adds the Twitter handle for each merchant along with their website URL; I’ll start to make this dashboard a bit more interactive then by adding these as link parameters to my merchant_name dimension definition in my LookML view definition.

dimension: merchant_name {
    group_label: "Monzo Transactions"
    label: "Merchant Name"
    type: string
    sql: ${TABLE}.merchant_name ;;
    link: {label:"Merchant Website"
           url:"https://{{ fluentd_monzo.merchant_metadata_website._value }}"
           icon_url: ""
    link: {label:"Merchant Twitter Profile"
           url:"{{ fluentd_monzo.merchant_metadata_twitter_id._value }}"
           icon_url: ""

Monzo also provides Google Places API and Foursquare API merchant metadata lookup codes along with each transaction record, so I add two more link parameters to call these two web services.

link: {label:"View in Google Maps"
       url: "{{ fluentd_monzo.merchant_address_latitude._value }},{{ fluentd_monzo.merchant_address_longitude._value }}&query_place_id={{ fluentd_monzo.merchant_metadata_google_places_id._value }}"
       icon_url: ""}
link: {label:"View in Foursquare"
       url: "{{ fluentd_monzo.merchant_metadata_foursquare_website._value }}"
       icon_url: ""

Now when I click on the name of a merchant within any data visualization using that data point I get a menu of links I can click on for more information.


Clicking on the View in Google Maps link passes across the merchant’s latitude, longitude and Google Places API key to the Google Maps web service, and Looker then opens a new browser tab and gives me all the information I’m looking for using the Google Places web service.


All of this is good to know but it’s only really of value if I can make this data actionable, for example by linking it to the Google Tasks web service I use to record personal, non-project related tasks and reminders. To do this I’m going to use Looker’s Data Actions feature together with Zapier’s integration with Google Tasks to tie it all together.

First, I create a Zapier Zap that listens for a webhook call and then uses the parameter values I provide from the Looker data action to to create a new Google Task. This article on the Looker Discourse website goes into more detail on how to set up Zapier webhooks integration with data actions if you’d like to set up something similar yourself.

action: {
      label: "Add Google Task"
      url: ""
      icon_url: ""
      form_param: {
        name: "Task Name"
        type: string
        required: yes
        default: "{{ merchant_name  }}"
      form_param: {
        name: "Due Date"
        type: string
        required: yes
      form_param: {
        name: "Action"
        type: textarea
        default: "Book table at {{ merchant_name  }}"
     form_param: {
      name: "Urgency"
      type: select
      option: {
        name: "Low"
        label: "Low"
      option: {
        name: "High"
        label: "High"

Then I add an action parameter into my merchant LookML dimension definition that defines a form with default values, allows me to amend those values if needed and then HTTP POSTs them as the payload to the Zapier web service.

Now, another menu link item is displayed to allow me to create Google Task items when I click on merchants in a Monzo transactions dashboard, like this:


When I click on that link a dialog is then shown, giving me the option to amend the task title and details and add a due date and urgency to the task I’m about to define.


Then finally, I press Submit and then go over to the Google Tasks web interface to see my new task recorded and ready for me to action tomorrow morning.

Updates to Oracle Analytics Cloud, Oracle BIEE 12c and Oracle DV Desktop

Late last year I covered one of the first releases of Oracle Analytics Cloud (OAC) on this blog covering the v3 release of OAC; since then the product has had a major UI refresh with OAC v4 and so I thought it’d be worth covering this new release along with updates to OBIEE12c and DV Desktop in a follow-up to last year’s post.

OAC’s original look and feel was based on OBIEE12c v1 that itself contained the first iteration of Oracle Data Visualisation (DV) and whilst the new modern flat UI that release introduced was an improvement on the 11g release before it since then Oracle have rapidly iterated with Oracle DV Desktop and DV Cloud Service and the UI from those products is now the primary user interface for Oracle Analytics Cloud v4.

If you’ve not used Oracle Analytics since the OAC v3 and the initial OBIEE12c versions you’ll be surprised how much the user experience has changed since then; when OBIEE12c first came out most of us still considered Answers and Dashboards to be the primary reporting UI with the BI Repository being the central carefully governed and dimensionally-modelled source of all reporting data.

Since then the world has moved-on and analytics is now all about empowering the end-user through self-service, easy-to-use tools that enable you to do most tasks without having to involve the IT department or do lots of up-front data modeling and data onboarding. Oracle DV Desktop started meeting this need by introducing basic data preparation features into the product so users could upload spreadsheets and other other data and do some basic formatting and tidying before analyzing it on their desktop, and those data upload and preparation features are now available for OAC users in the v4 release. To upload a spreadsheet file into OAC you now just drop it onto the web page and you’re then given a preview and the option to upload its contents to the analytics instance.

After that you can trim, split, derive and change the datatypes of data you upload so that it works in the most optimal way with the analysis features of the product, for example by turning dates in string formats not recognized by Oracle DV into proper date datatypes that can then be used in time-series analysis.

For data transformation needs that go beyond basic formatting and datatype changes you can now build ETL-style routines that filter, aggregate, transform and add external datasets all using simple point-and-click user tools, and if you’ve licensed the new OAC Data Lake Edition these transformations then extend to include model training, forecasting and sentiment analysis. In the example below I’m using OAC Data Lake Edition to forecast weight forward from a set of smart scale readings that’s then executed at the BI Server level using Python maths and statistics libraries.

I’ll actually be talking about OAC Data Lake Edition in my presentation at ODTUG KScope’18 at Walt Disney World Florida this June, where I’ll cover some of these extended data flow features along with Oracle Big Data Cloud and Oracle Event Hub Cloud Service in my session “BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition”.

Cloud of course is great but not every Oracle Analytics customers has made the move yet, or like myself you might have a cloud instance you can spin-up as-and-when you need it but then use regular on-premises OBIEE12c in a VM for your day-to-day learning and experimentation.

With Oracle’s cloud-first strategy and focus on Oracle DV as the main innovation path for analytics features this meant that my OBIEE12c instance running in a Google Compute Engine VM was starting to lag behind OAC v4 and DV Desktop in-terms of look-and-feel and all of these new features, and so I was pleased to note the other day that the (aka v4) release of OBIEE12c is now available for download on OTN that includes many of the features currently available in OAC v4 and DV Desktop (but not the advanced data flow features in OAC Data Lake Edition, for example)

Oracle Data Visualization Desktop has also been updated recently to connect to Oracle Autonomous Data Warehouse Cloud and I’ll be covering that new capability in a blog post in the near future; but more importantly for me at least….

… it also now supports Mac OS X 10.13 High Sierra, great news for us Mac users who had to stop using DV Desktop on our laptops late last year when that most recent update to OS X came out but DV Desktop didn’t yet support it.

Timeline Charts, Derived Tables and Analytic Functions in Looker 5

One of the data streams that comes into my personal Google BigQuery data warehouse is a record of all the Hue light switch settings and illuminance levels, Nest thermostat settings and motion sensor readings coming in via the Samsung Smart Things hub under the stairs back home. For the Hue lightbulbs I get a stream 1s and 0s that initially get stored in a four-column table in BigQuery, like this:

I then create a SQL view over that table that joins to a lookup table containing device types and room allocated for each of the devices in the data stream, and sets a metric type for each recording so that I can analyze all the thermostat readings together, bring together all the light switch settings and so on.

Its then quite straightforward to bring this SQL view into Looker and add it to the rest of the data streams I’ve brought together there so I can see, for example, whether the kitchen lights get turned-off after everyone’s gone upstairs after breakfast or whether I should try and automate those as well as my kettle.

Looker 5 introduced another way to visualize data with defined start and end events in the form of the new Timeline Chart, a feature I tweeted about the other day when I brought data on tasks from Asana into my BigQuery data warehouse via a Fivetran integration.

It’d be interesting to use this type of chart to show how one particular light switch changed from on to off and then on again over a period of time, or take all the Hue lights for a given part of the house and see whether anybody turns the lights off ever … or whether they need to be linked to the motion sensors and automatically turned-off after a period of inactivity in each room.

To use this type of chart I first need to create an “end date” field for each of my event records, so each timeline has a start and end date with other fields being optional for label and value. Looking at the list of events for the light switch I looked at earlier and ordering the events by time recorded you can see the information I need is there:

device = 'kitchen seating area spot 1'
AND raw_value IN ('off','on')

What this calls for is the LEAD BigQuery Standard SQL analytic function that returns a value based on a subsequent row to the one we’re working with, and I can combine that with the TIMESTAMP_DIFF function to compute the time between the start and end event dates as that would be useful to know too. The SQL expressions for the two new fields would therefore be:

LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds

I could add these column expressions into the SQL view definition in BigQuery and order the rows by device, metric and timestamp, like this:

date_time as date_time,
device as device,
LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) as end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds
from `aerial-vehicle-148023.personal_metrics.smartthings_readings`
where device = 'kitchen seating area spot 1'
and metric = 'switch'
and raw_value in ('on','off')
order by 2,3,1

Executing that SQL shows the logic is working, but I’d then have to maintain that view within BigQuery and that might not be the most convenient place to add new code.

Instead, I could just go into Looker and create a new view there based on a derived table SQL expression and do the work there. Right now my LookML model looks like the excerpt below, where you can see the sensor readings view joined into the rest of the explore so all my data can be analyzed together.

connection: "rittman_bigquery"
include: "*.view.lkml" # include all views in this project
include: "*.dashboard.lookml" # include all dashboards in this project
explore: date_dim {
case_sensitive: no
label: "Data Warehouse"
join: fluentd_transactions {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_transactions.date_minute5} ;;
relationship: many_to_many
join: smartthings_readings {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${smartthings_readings.date_minute5} ;;
relationship: one_to_many
join: fluentd_uber_rides {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_uber_rides.date_minute5} ;;
relationship: many_to_many

I now create a new LookML view that uses a derived table SQL query as the view definition rather than simply referencing an existing BigQuery table. Note how I’ve used the same view_label as the main LookML view containing my event data, so that the dimensions and metric I define here appear alongside all the other smart device fields the same explore view.

view: device_event_end_and_timespan {
view_label: "6 - Smart Devices"
derived_table: {
date_time as date_time,
device as device,
metric as metric,
LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) as end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds
from `aerial-vehicle-148023.personal_metrics.smartthings_readings`
order by 2,3,1;;
dimension: date_time {
type: date_time
hidden: yes
sql: ${TABLE}.date_time ;;
dimension: device {
type: string
hidden: yes
sql: ${TABLE}.device ;;
dimension: metric {
type: string
hidden: yes
sql: ${TABLE}.metric ;;
dimension_group: end_date_time {
group_label: "End Date"
label: "End"
type: time
timeframes: [
sql: ${TABLE}.end_date_time ;;
dimension: pk {
primary_key: yes
hidden: yes
type: string
sql: concat(cast(${TABLE}.start_date_time as string), ${TABLE}.device) ;;
measure: value_duration_seconds {
type: average
sql: ${TABLE}.value_duration_seconds ;;

Then I join this derived table view back into the explore within my LookML model, as below:

connection: "rittman_bigquery"
include: "*.view.lkml" # include all views in this project
include: "*.dashboard.lookml" # include all dashboards in this project
explore: date_dim {
case_sensitive: no
label: "Data Warehouse"
join: fluentd_transactions {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_transactions.date_minute5} ;;
relationship: many_to_many
join: smartthings_readings {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${smartthings_readings.date_minute5} ;;
relationship: one_to_many
join: device_event_end_and_timespan {
type: left_outer
sql_on: ${smartthings_readings.date_time} = ${device_event_end_and_timespan.date_time} and
${smartthings_readings.device} = ${device_event_end_and_timespan.device} and
${smartthings_readings.metric} = ${device_event_end_and_timespan.metric};;
relationship: one_to_one
join: fluentd_uber_rides {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_uber_rides.date_minute5} ;;
relationship: many_to_many

and now I can create timeline charts that show me which lights were on over a particular period of time, like this:

or more importantly, work out why the bathroom never seems to be free when you’ve got two teenage kids in the house.