Introducing MJR Analytics … and How Two Years Go So Fast When You’re Learning Something New

Today I’m excited to be launching MJR Analytics, a new consulting company focusing on modern, cloud analytics projects using technology from Looker, Qubit, Fivetran, Oracle and Snowflake and development techniques learnt from my time working as an analytics product manager at a startup in London.


Our new website (and no that’s not me sitting in the chair)

So what have I been up to in the two years since I left my old consulting company, and how has that experience and the way I’ve been working with analytics technologies over that time inspired me to start another one?

Two years ago I announced on Twitter that I’d left the company I’d co-founded back in 2007 and intended to now take on a new challenge, and then spent the rest of the week at Oracle Open World cycling over the Golden Gate Bridge and coming back on the ferry and trying to decide what that challenge might actually be.


For most of my time in the IT industry I’d been working on projects implementing products from vendors such as Oracle and I’d always been interested in how these products came to market, how software vendors came up with a strategy and roadmap for those products how the team behind those products worked with the engineers who built them.

I’d also become increasingly interested in the startup world and towards the end of time time at Rittman Mead had taken-on an informal role advising Gluent, Tanel Poder and Paul Bridger’s product startup who were building software that enabled big enterprise customers to offload their data warehousing workloads from expensive proprietary databases onto to cheap, flexible storage and processing running on Hadoop clusters.

What appealed to me about working more formally with Gluent was the opportunity it gave me to work with two smart founders and an even smarter development team developing a product built entirely on big data technology I’d until then only scratched the surface with on consulting gigs. The product marketing role I took on was all about establishing what market that product was primarily intended for, how we went about positioning the product to appeal to that market and how we then brought that product to market.

Understanding these four things are crucial if you’re going to actually get customers to buy your startup’s product:

  • who is the buyer

  • what problem does your product solve

  • what is the value solving that problem, and

  • why you’re the first product to solve it for them

otherwise you’ll spend your time building a solution to a problem that nobody actually has, and that’s the reason the majority of tech startups end-up failing. Solving a problem for a market that’s willing to pay you money to solve is called “product/market fit” and if your product has it, and you’ve built your business such that it scales linearly as more customers discover your product, then you’re going to make a lot more money than a consultancy constrained by how many hours in the week your consultants can work and the upper limit on how much you can charge for a single person’s time.

I also learnt the distinction between product marketing, product management and product development in my time at Gluent. Going back to my time as a consultant attending product roadmap sessions at conferences I never quite knew which parts of the product team those speakers came from, but in summary:

  • Product Marketing is about taking a product that’s typically already built and then deciding the product’s positioning and messaging, then launching the product and ensuring the sales team, sales engineering and customers understand how it works and what it does; as such, this is a marketing role with a bit of technical evangelism thrown in

  • ProductDevelopment is the actual building of the product you’re looking to sell, and requires an engineering skillset together with the inspiration that typically came up with the product idea in the first place along and an entrepreneurial side that made you want to build a company around it

  • Product Management is more of a customer-facing role and is about understanding what your customers want and what their problems and use-cases are, and then creating a strategy, roadmap and feature definition for a product that will meet those needs

Despite my undoubted product marketing skills based around PowerPoint and internet memes:


In product marketing, it’s never too soon to put a Santa hat on a photo of the founder

in the end it I realised that it was product management that interested me the most and, after a couple of meetings with an old friend who used to run product management at Oracle for their business analytics product line and who had recently moved to London and now lead the product team team at Qubit, a technology startup created by four ex-Googlers building marketing technology products based-around Google’s big data and cloud technology, I joined their team later in 2016 as product manager responsible for the analytics features on their platform.

I spoke about the Qubit and the partnership we established with Looker back in May last year at a presentation at Looker’s JOIN 2017 conference in San Francisco and the slide deck below from that event goes into the background to the product and the problem it solves, helping customers using Qubit’s personalization platform make more effective use of the data we collected for them.

The product and data engineering teams at Qubit did an excellent job bringing together the features for this product and in hindsight, the bits I was most proud of included:

  • The business metadata layer we created on Google BigQuery and Google Cloud Platform to translate an event-level normalized many-to-many data model designed for fast data ingestion into a denormalized, dimensional data model designed for easy use with BI and ETL tools

  • Additional integration we created for the Looker BI tool including a set of industry vertical-specific Looker models and dashboards we then made available on the Looker Block Directory and in a Github public repo


Screenshot from Personalization Analytics Block for Looker by Qubit

  • The multi-tenant data warehouse and hosted Looker instance we then put together to enable customers without their own Looker instance to make use of their data in Google BigQuery, doing so in a way that supported per-tenant extensions and customizations by the customer or their implementation partner.


Technical Architecture for Live Tap as presented at Looker JOIN 2017

What I’ll take-away from my time at Qubit though was the incredible amount that I learnt about product management, product engineering, how to build and run a successful startup and team who are still highly-motivated seven years in and the introduction it gave me to the analytics and data-led world of digital marketing, eCommerce and modern data analytics platforms.

Consulting is a popular route into product management and the experience I brought to the role in areas such as business metadata models, analytical techniques and the needs of BI and ETL developers proved invaluable over the eighteen months I worked as part of Qubit’s product and engineering teams, but moving into product management within a young, technology-led startup founded by ex-Googlers and working with some of the smartest and most innovative people I’ve ever met involved learning a whole new set of skills including:

  • Developing on a new technology platform (Google Cloud Platform) within a new industry (eCommerce and digital marketing) and understanding a whole new set of analytics use-cases and customer roles (A/B testing, stats models and event-based analytics used by analysts and strategists within eCommerce businesses) that I described in a presentation at last year’s UK Oracle User Group Tech Conference in Birmingham:

  • Working as part of a team rather than directing that team, and managing -up as well as down, a technique I had to relearn pretty quickly in my first few months in the role

  • Learning to achieve my goals through influence rather than in the top-down way I’d been used to getting things done leading customer projects, and as CTO and owner of the company that team worked for

  • Saying no to customers rather than yes as you did as a consultant, as your objective is to build a product that solves the most important customer needs but doesn’t burden you with so many features addressing niche use-cases that you end up with Homer’s car and can’t innovate the product in future releases

  • How to take a product through its lifecycle from identifying a need that makes sense for your company to meet, through prototyping, alpha and beta releases to successful first launch and then creating a strategy and roadmap to manage that product over its complete lifecycle

  • How to use a new generation of modern, cloud-native data analytics tools such as Looker together with products such as FiveTran, Google Cloud Platform, Qubit, Snowflake DB and Snowplow Analytics that were increasingly also being adopted by the FinTech, MarTech and B2C startups clustering in London and other European/North American tech hubs

I learnt so much from my colleagues at Qubit about products, engineering and building a successful and motivated team that put up with my jokes and built the most technologically-advanced marketing personalization platform on the market.

But what my time at Qubit also made clear to me was that, when it came down to it, what really motivated me to get up in the morning, learn all these new technologies and still be wildly excited to come into work in the morning twenty years later was:

  • using data and analytics to find new insights and uncover new opportunities in a customer’s data set

  • working with that individual clients, over time, to enable them to find more of those insights and opportunities themselves

  • find new innovations in analytics technologies and how we deliver projects to make this process cheaper, faster and more likely to be successful

  • and building a team, and crucially a business, to do all of this at scale and offer a full set of analytics-related consulting services built around modern analytics tools and delivery techniques

Which is why after two years away from the consulting business and two enjoyable, rewarding and enlightening years working on the other side of the data and analytics industry I’m now launching my new consulting company, MJR Analytics; and I hope to be working with many of you as clients or members of our team over the coming months and years.

Date Partitioning and Table Clustering in Google BigQuery (and Looker PDTs)

Google BigQuery is a data warehousing-orientated “table-as-a-service” product from Google Cloud Platform that, like Oracle Exadata, optimizes for full-table scans rather than selective row access and stores data organized into columns, rather than rows, to align better with filtering and aggregation workloads associated with data warehousing workloads.

BigQuery charges by amount of data stored in your tables and the data you’ve read in executing SQL queries against those tables, so BI tools such as Looker that work efficiently with BigQuery only request the columns they need when querying BigQuery datasets rather than running a SELECT(*) and throwing away what’s not needed.

To illustrate this let’s run a query that requests all columns (“SELECT (*) FROM …”) from a BigQuery table, and as you can see from the screenshot below it’s reading through all 1.47GB of table data to return that full set of columns to the user.

If the users’ query only really needed just two of those columns, requesting just those brings down the amount of data read to just under 10MB as most of of that table’s data is stored in columns of data that aren’t needed for the query.

BigQuery historically has supported table partitioning based on the date you loaded data into the table which sounds great until you realise that it’s the transaction date, not the ingest date, that most of your user’s queries filter against.

You could also use table decorators in Legacy SQL to point to the particular day partition your data was stored within but this only went back for a maximum of seven days and required your query tool to support this non-standard feature of SQL; earlier in this year though Google introduced a more flexible form of date partitioning as a beta release feature that allows you to choose the date column your table would be partitioned by, and more recently introduced a feature called table clustering that stores data within a table sorted by the columns you most commonly filter on when running queries against it.

To show how date partitioning and table clustering work, I’ll start by running a query to return just a month’s data from the five years of data held within my table; as you can see in the screenshot below, BigQuery performs a full table scan and reads through all 1.37 GB of data in the table to return just the month of data my query requested.

Standard SQL now supports DDL commands such as CREATE TABLE and CREATE TABLE … AS SELECT, along with a PARTITION BY clause that lets you specify a timestamp or date column to partition the table by. I’ll use these new features to create a copy of that same table, this time partitioned by the timestamp column I’m filtering on in my query predicate …

… and the DDL statement fails. What’s happened there then?

Turns out that BigQuery tables are limited to 2500 partitions for the moment, with any one single load operation limited to 2000 (hence the error) and with partitioning limited to just date and timestamp columns and partitions a day in length it means any one table can only hold around five years worth of data, beyond that you’ll need to create multiple date partitioned tables and UNION them together through a SQL view.

For now though I load my table with just five years of data and then re-run the query that requests a single day from that overall five years; now BigQuery has only read and processed 57 MB of data and it’d be a fraction of that if I only requested the two columns I needed, not all columns from that table.

But what about queries that filter against the other columns in this table? We can’t set up table partitioning on STRING, INTEGER or any other type of column datatype so my original query if re-run against the date partitioned table reads just as much data as it did before.

What we could do is re-create the table with its data pre-sorted by those two particular columns using another new feature called table clustering, so that queries that filter against those columns find the blocks of data they’re looking for faster and can skip completely the ones that don’t.

If like me you’re more familiar with databases such as Oracle, “table clustering” is all about storing data from tables sharing a common cluster key together in the same data blocks, so that queries against that group of tables filtering on that key return data faster.

Table clustering in BigQuery is more analogous to loading regular Oracle tables using data from a pre-sorted file and comes with the same benefits and limitations; in BigQuery’s case it takes care of the pre-sorting and table configuration for you but the same limitations still apply around how you filter the table and what happens when you load more data afterwards.

Let’s set-up a clustered table now that stores its data ordered by the two columns used in the query I ran a few moments ago.

Now when I run a query filtering on those columns against this partitioned, clustered table the amount of data read goes down compared to before, and results are returned a bit faster; if I included the partition key column in the query as well, returning just a few days’ data, it’d be faster still.

But queries that filter using any of the other columns don’t benefit from this clustering and moreover, the benefits of sorting the data initially loaded into a clustered table are lost over time as you load more (unsorted) data into it, meaning that to really benefit from clustering a table you have to rebuild it regularly using a CREATE TABLE … AS SELECT.

Table clustering in BigQuery is nice to have but similar to pre-sorting a direct path load into Oracle database tables, it’ll take a lot of rebuilding and careful querying to get the most benefit from and with the size of most BigQuery tables, I doubt that rebuilding will happen much in-practice.

BI tools such as Looker can make use of table partitioning and clustering in queries straight away as no changes are required in the query SQL you write, everything is handled in the table definition. Where you might want to set up partitioning yourself as a Looker developer is for the persistent derived tables (PDTs) that Looker can create to materialize the results of a view you define in your Looker model to derive a dimension attribute using a subquery, for example to calculate the sequence number of a users’ order for retention analysis for an eCommerce dashboard as shown in the screenshot below.

Looker has for some time come with database-specific settings for particular source database types such those used for Redshift used in the example above, and now supports date partitioning for persistent derived tables through a new partition_keys setting as announced in this recent Looker forum post.

Finally, if you’re interested in how partitioning is developing as a feature within BigQuery, and some of the edge-cases and requests for additional functionality that users of partitioning are asking for, this feature request on the BigQuery issue tracker is useful to read-through.

Extending support to more than 2500 partitions seems to be the most popular request along with allowing integer and string datatype columns to be used for the partition key, but also look out for issues around re-loading data into historic partitions and the cost and work involved in rebuilding large tables to re-cluster its data or apply partitioning for the first time.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.3

In this series of three blogs on Oracle Analytics Cloud Data Lake Edition I’ve setup an object store data lake in Oracle Cloud using Oracle Big Data Cloud and Oracle Storage Cloud, and ingested streams of real-time event data from IoT and social media sources into Oracle Cloud’s object storage service using Oracle Event Hub Cloud Service.

The event-stream data I staged into Storage Cloud was then copied into parquet files on HDFS and then presented out to BI and ETL tools through Big Data Cloud’s Thrift Server interface, so that now I’m ready, after a short diversion into defining the data engineer role that would typically work with this new product edition, to start exploring some of Oracle Analytics Cloud Data Lake Edition’s new data flow and predictive model preparation features.

The diagram below shows where OAC Data Lake Edition fits into my project architecture, performing the tasks of transforming and enriching the incoming dataset and then presenting my at-scale data out to end-users for analysis using OAC Data Lake Edition’s Data Visualization features.

Looking at the homepage within OAC Data Lake Edition I can see my two Hive tables listed within the dataset catalog, alongside other datasets I’d uploaded directly into OAC. This visual catalog of available datasets is also the new homepage interface that OBIEE12c now adopts, with both cloud and on-premises versions of Oracle’s BI tools now relegating the old “Answers” homepage to something you have to dig around and specifically look for in favour of this more self-service Data Visualization starting page.

I’ll have to write an article on Answers and how powerful its interface is, and the full dimensional model it exposes from the Oracle BI Repository, in a blog post sometime in the future as it’s almost in danger of getting forgotten about.

Moving on though, the first transformation I need to do on all the incoming datasets is to take the timestamp column in each table and convert it to a format that OAC recognises as a valid TIMESTAMP datatype format, then convert those columns to TIMESTAMPs so that DV can automatically enable time-series analysis by day, month, quarter, hour and so on. I do that using a feature that’s also present in OAC Standard Edition, the lightweight data preparation interface that’s presented to users when they first add a new data source into OAC’s dataset catalog, shown in the screenshots below.

Where OAC Data Lake Edition gets really interesting right now both in terms of differences vs. the on-premises versions of OBIEE I used to use, and in terms of it’s “data engineering” potential, is with a feature called Data Flows.

Most self-service BI tools now have a basic data loading and data preparation capability today with Tableau Data Prep being one of the latest examples. Designed to handle more complex data prep use-cases than basic datatype changes and field-splitting, they give end-users the ability to do this type of work themselves rather than trying to do it in Excel or handing the work off to the IT department and having to wait days or weeks to get that data back.

Data Flows are a feature that’s been introduced since the original on-premises version of OBIEE12c that I last used when working out in consulting, and provide you with what’s effectively a lightweight, multi-step ETL tool that executes transformations using the BI Server’s Model Extension feature, introduced back when OBIEE12c first came out as the mechanism to enable on-the-fly data mashups between server-side and user-uploaded datasets.

Looking at the transformation operators available in OAC Data Lake Edition v4 there’s quite a few that apply to data lake and data engineering-type workloads including running Python statistical analysis scripts and predictive model training and model build; there’s also an operator for creating an Essbase Cube, with Essbase in this instance positioned as a fast ad-hoc analysis back-end for use with the data visualization part of OAC.

For now though there’s two transformation tasks I want to do with my Hive datasets; first, enrich the incoming social media data by analyzing the sentiment in each tweet and then writing the data plus this sentiment tagging back to the Oracle Big Data Cloud environment, so that I can then turn those sentiment tags into a score and create a data visualization showing who sends me the most tweets and how crazy they are overall.

The second data enrichment I wanted was on some Strava cycling workout data I’d uploaded directly into OAC using the CSV file upload facility; using the model train and build Data Flow operators I defined a model to predict how many “kudos”, the Strava equivalent to Facebook “likes”, I’d get for a given cycle workout with a number of different variables available to the model in order to make the prediction — for example, distance and elevation gain, map location, effort expended and so on.

Then, after running the model build step and looking at the predicted values and the actual ones for the remainder of the dataset not used for model training, you can see the predicted kudos values are fairly in-line with the ones I actually recorded for those rides.

Another feature that’s now in Oracle Analytics Cloud is automated data diagnostics, or Explain. Explain uses machine-learning libraries and that same model extension/XSA BI Server framework to help users quickly understand the value distribution and statistically correlated driving factors for a particular dataset, and learn which segments or cohorts have the highest predictive significance. Enabled by a bunch of extensions to BI Server logical SQL I used the feature first on the sentiment scoring I’d performed earlier on, and then on the steps data I’d brought into Oracle Big Data Cloud from my Fitbit device, after converting the numeric step counts into a text attribute by bucketing its values into low, medium and extreme bucket values.

This is pretty powerful stuff, with automated understanding and context-gaining about new datasets being one of the most user-enabling features I’ve seen arrive recently in BI tools with the best example of this being BeyondCore, now part of Salesforce Einstein. OAC lets the user pick the most useful of the Explain facts and driver insights and publish them to a Data Visualization dashboard like the one below, showing the most predictive and significant variables in my dataset that influence the steps I take each day.

Which leads neatly to the final “data at-scale” feature in OAC, the Data Visualization feature that in my case is querying the ingested, transformed and now enriched datasets I’ve got running on my Oracle Big Data Cloud instance alongside Oracle Event Hub Cloud and Oracle Analytics Cloud Data Lake Edition.

Thank you once again to the Oracle ACE Director program for providing access to Oracle Analytics Cloud Data Lake Edition, Oracle Big Data Cloud and Oracle Event Hub Cloud services over the past few weeks. If you’re looking to try these new services out there’s free trials available for most of Oracle’s Cloud products and many of the new features are also available in Oracle Data Visualization Desktop 12c and Oracle Business Intelligence 12c, both of which can be downloaded for training and evaluation under the OTN license scheme.

Wrapping-up this three part series on Oracle Analytics Cloud Data Lake Edition and Oracle Big Data Cloud I’d like to go back to the two (serious) questions I asked myself at the end of the previous post:

  1. Has OAC Data Lake Edition got anything actually to do with data lakes, and is it a useful tool for aspiring Oracle technology data engineers?
  2. How does it compare to my old favourite Oracle big data product Oracle Big Data Discovery, officially still available and not quite dead yet but existing in some strange zone where the on-premises version stopped getting updates a while ago and the cloud version is for sale but you can’t buy it unless you know the right person to ask and he’s actually gone to Cloudera

So has Oracle Analytics Cloud Data Lake Edition got much to do with actual “data lakes”? Well … it integrates with Oracle Big Data Cloud and apparently comes with an option to run those data flow transformation in Big Data Cloud’s Apache Spark environment, though to be fully-transparent I didn’t see that as an option when doing my evaluation so can’t comment on how well or not that works.

Like Oracle Big Data Discovery before it, OAC Data Lake Edition makes you structure your incoming event stream data with Hive table metadata before you can work with it, but that’s actually fairly standard practice with most data visualization tools that work with Hadoop and data lake environments.

Having Essbase in this product package, alongside the data lake functionality, did make me scratch my head a bit and wonder, “why?” — data lakes and Essbase are about as opposite as you can get in terms of target users and use-cases and I think this Data Lake Edition is as much about creating a product package and price point that’s mid-way between the OAC Standard and Enterprise Edition.

But there is some logic to having Essbase in this edition; it provides a set of easy-to-use loading and preparation tools for Essbase making it easier for customers new to that product to start using it, and Essbase with its latest hybrid ASO/BSO storage format is surprisingly scalable and blindingly-fast to query, a great potential back-end for enabling data analysis “at-scale” using Oracle Analytics Cloud’s data visualization features.

I also get the feeling that this initial v4 version of OAC Data Lake Edition is more of an initial first-cut release to get something out to customers, establish the product package and validate the roadmap and market assumptions. Oracle Analytics Cloud v5 isn’t too far off and I’d expect incremental improvements and new features in areas such as natural language processing and machine learning built-into the developer experience; I wouldn’t be surprised to see Oracle Big Data Preparation Cloud making its way into the product given its obvious fit and overlap with Data Lake Edition’s data prep features.

But where I really see an interesting future for OAC Data Lake Edition is when it starts to integrate product features and the development team from Oracle’s recent acquisition of Sparkline Data.

I came across SNAP, Sparkline‘s platform for building OLAP-style dimensional models over data lakes and cloud object storage layers about a year ago when researching analytics platforms at Qubit and quite frankly, it’s as revolutionary in terms of todays data lake analytics market as OBIEE (or the nQuire Server) was back in 1995 with its virtual data warehouse over application and data warehouse sources.

Take these two slides from the Sparkline website and imagine them as the future of Oracle Analytics Cloud analyzing event-streams and big data cloud-hosted datasets…

and you can see why I’m keen to see where Oracle Analytics Cloud Data Lake Edition goes over the next couple of years. I’m speaking on OAC Data Lake Edition at ODTUG KScope’18 in Orlando in just a couple of weeks time so come along if you’re there, it should be an interesting talk.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.2

A couple of weeks ago I posted a blog on Oracle Analytics Cloud Data Lake Edition and the Oracle Cloud big data products you’d typically use to create a cloud-hosted, object storage data lake. In this and the final post in the series I’m going to look at some of the data transformation and predictive model building features that come with the Data Lake Edition of OAC and extend the base capabilities in this area that OAC Standard Edition comes with, and the data engineering role that uses the data lake this Oracle Analytics Cloud product packaging option is designed for.

As a reminder, OAC Standard Edition is more-or-less Oracle Data Visualization running as a single-tenant cloud-hosted VM (cluster) that’s either customer managed, or Oracle-managed if you go for the new Autonomous Analytics Cloud that interestingly is charged at a lower hourly rate than the customer-managed version; presumably it costs Oracle less to run OAC when it’s doing all the administration rather than letting customers control the settings, which makes sense.

It does make you wonder when all this VM nonsense will go away though as well as the distinction between OAC and the DVCS/BICS managed service cloud analytics products, with all of these services just becoming packaging options for a multitenant service partitioned by customer instance as the likes of Looker and Google Cloud Platform do today.

But for now, Oracle Analytics Cloud Data Lake Edition is described on the product homepage as having the following high-level capabilities compared to OAC Standard Edition:

  • Accelerated analysis via Essbase
  • and support for creating scalable, shareable high performance data flows.

whilst the product documentation states that it includes all the OAC Standard Edition features, plus:

  • Explore and replicate data
  • Create scalable data flows including ingestion, transformation, and persistence
  • Execute data flows by using Apache Spark or Oracle Database Cloud

with two edition-specific scenarios listed:

  • Explore Data Lakes, described as “You can create data flows to visualize data in your data lakes”, and
  • “Replicate Data in Data Lakes”, explained as “You can replicate data from Oracle Service Cloud (RightNow) into Oracle Big Data Cloud Service, Oracle Database Cloud, and Oracle Database. For example, you can replicate data from customer service records in Oracle Service Cloud into your data lake on Oracle Big Data Cloud Service and then perform further data preparation and analysis with data flows and data visualization projects.”

I’m guessing most readers won’t have heard of Oracle Service Cloud but for Oracle Cloud customers it’s quite a useful initial integration, as Service Cloud is Oracle’s cross-channel digital customer call centre application that brings together customer service interactions across web, social media and offline service channels. More interestingly though and you may have missed it back at the start but Data Lake Edition is the edition that ships Essbase, Oracle’s multi-dimensional OLAP server, in what almost seems like an afterthought right now but will become, I suspect, a more significant defining feature of this product package as time goes on.

For the purposes of this series of blog posts though, what I’m really interested in are the data engineering features in OAC Data Lake Edition — but before I start looking into what they might be, I guess I’d better explain what I mean by a data engineer.

I first came across the term data engineer in a blog written by Curt Monash — still the #1 person I’d love to have come on the Drill to Detail Podcast, if you happen to read this Curt — where he wrote:

“I learned some newish terms on my recent trip. They’re meant to solve the problem that “data scientists” used to be folks with remarkably broad skill sets, few of whom actually existed in ideal form. So instead now it is increasingly said that:

- “Data engineers” can code, run clusters, and so on, in support of what’s always been called “data science”. Their knowledge of the math of machine learning/predictive modeling and so on may, however, be limited.

- “Data scientists” can write and run scripts on single nodes; anything more on the engineering side might strain them. But they have no-apologies skills in the areas of modeling/machine learning.”

I think it’s probably true to say that the term really started to enter people’s consciousness, at least the sorts of people I hang around with, when Maxime Beauchemin from Airbnb wrote his seminal blog “The Rise of the Data Engineer” where he introduced this new discipline as:

“Like data scientists, data engineers write code. They’re highly analytical, and are interested in data visualization.

Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science.

In relation to previously existing roles, the data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.

In smaller companies — where no data infrastructure team has yet been formalized — the data engineering role may also cover the workload around setting up and operating the organization’s data infrastructure. This includes tasks like setting up and operating platforms like Hadoop/Hive/HBase, Spark, and the like.”

Maxime’s blog struck a chord with me as that’s pretty much how I’d describe what I do, or at least it’s the work I most enjoy doing; building big data systems to support BI and data warehousing workloads where my knowledge of how Hadoop, cloud and distributed data platforms complements the knowledge that other colleagues have around data science, machine learning and statistical analysis though importantly, I know enough about those areas to be able to deploy the models they build (or as we say over in the UK at least, I know enough about them to be dangerous).

I interviewed Maxime about his blog and the topic of data engineering on the Drill to Detail Podcast last May and if you’ve not done so already I’d recommend you give the episode a listen; for me though, having just spent the last eighteen months working in the London tech startup scene, my definition of a data engineer would be someone who’s just out of university and has an ironic beard, thinks he’s made ETL tools obsolete but is slowly, painfully, recreating one and is now just on the brink of discovering code templating, deployment contexts, change control and data lineage…

… but earns twice what you earn and gets all the girls. Because, data engineers are even more essential than data scientists if you’re serious about building a data lake, and data lakes are the new data warehouses.

Which bring us back to Oracle Analytics Cloud Data Lake Edition and these three blog posts. The first part of this three part series talked about Oracle Big Data Cloud and products such as Oracle Event Hub Cloud — Dedicated (who names these products at Oracle? Seriously?) and in this second I seem to have got rather diverted by trying to define what Oracle Analytics Cloud Data Lake Edition actually is and then even more so by defining the supporting role of a data engineer; in the final post tomorrow I’ll take a look at some of the data transformation, model definition and deployment and data engineering features in OAC Data Lake Edition and then, finally, conclude by asking myself two questions — OK, three:

  1. Has OAC Data Lake Edition got anything actually to do with data lakes, and is it a useful tool for aspiring Oracle technology data engineers?
  2. How does it compare to my old favourite Oracle big data product Oracle Big Data Discovery, officially still available and not quite dead yet but existing in some strange zone where the on-premises version stopped getting updates a while ago and the cloud version is for sale but you can’t buy it unless you know the right person to ask and he’s actually gone to Cloudera
  3. Will the ironic beard and quiff be enough, or do I need to go full ginger if I’m going to really be taken seriously as a data engineer?

Final part in the series on Oracle Analytics Cloud Data Lake Edition comes later this week.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.1

Some time ago I posted a blog on what analytics and big data development looked like on Google Cloud Platform using Google BigQuery as my data store and Looker as the BI tool, with data sourced from social media, wearable and IoT data sources routed through a Fluentd server running on Google Compute Engine. Overall, the project architecture looked like the diagram below…

… and I’ve got more-or-less the same setup running right now, with an additional GCE VM running Confluent Open Source to feed a subset of the event streams into a Druid cluster that I’m using to test out Looker, Superset and Imply for sub-second ad-hoc query analysis use-cases. More on that soon.

If you’re a regular reader of this blog you might recall a week or so ago I posted a blog on the new releases of Oracle Business Intelligence Enterprise Edition (OBIEE), Oracle Analytics Cloud (OAC) and Oracle Data Visualization Desktop (DVD) and mentioned a new packaging option for their cloud analytics product, Oracle Analytics Cloud Data Lake Edition. I’ve got a particular interest in what this product might be as I used to use the product it replaces, Oracle Big Data Discovery (BDD), fairly extensively in Oracle big data analytics and data lake projects a few years ago.

And Oracle Big Data Discovery was — technically at least — a great product. It combined the search and analytics features of Endeca Information Discovery with the scale and data transformation abilities enabled by Apache Spark and Hadoop, but suffered perhaps by being a bit ahead of the market and by not having any obvious integration with the rest of Oracle’s analytics and data management tools. By contrast Oracle Analytics Cloud Data Lake Edition is one of three packaging options for Oracle Analytics Cloud and includes all of the functionality of OAC Standard Edition (Oracle Data Visualization together with basic data preparation tools) as well as itself being a subset of the wider set of analysis, dashboarding and enterprise data modeling features in OAC Enterprise Edition.

An equivalent product architecture for ingesting, transforming and analyzing my IoT, wearables and social media data in Oracle Cloud would look something like the diagram below, with the following Oracle Cloud Platform-as-a-Service (PaaS) products used for ingest, storage and analysis:

  • Oracle Event Hub Cloud Service: Apache Kafka running either customer or Oracle-managed with full access to Kafka’s REST Proxy and Kafka Connect
  • Oracle Big Data Cloud: Oracle’s new elastically-scalable Hadoop platform running Apache Spark, Ambari and other Hortonworks Data Platform components
  • Oracle Analytics Cloud Data Lake Edition: Oracle Data Visualization combined with more extensive lightweight ETL (“data flow”) components, text analytics and machine learning model training and build capabilities

In this example I’m using Apache Hive and Parquet storage as my column-orientated data platform but of course I’ve now also got Oracle Autonomous Data Warehouse Cloud as an option; I’ll stick with Hive on Oracle Big Data Cloud for now though as this gives me the option to use Apache Spark to transform and wrangle my data and for building machine learning models using SparkML and, via pySpark, Python Pandas. In what’s the first of two posts in this short series I’ll be looking at how the data pipeline is set up, and then in the second post I’ll look at Oracle Analytics Cloud Data Lake Edition in detail focusing on the data transformation, data engineering and data science features it adds beyond OAC Standard Edition.

The development environment I put together for this scenario used the following Oracle Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) components:

  • Oracle Compute Classic and Storage Classic Services
  • Oracle Database Cloud Service, with the 11g database option
  • Oracle Event Hub Cloud Service Dedicated, with Kafka Connect and REST Proxy nodes
  • Oracle Big Data Cloud, single node with Hive, Spark 2.1, Tez, HDFS, Zookeeper, Zeppelin, Pig and Ambari
  • Oracle Analytics Cloud Data Lake Edition with Self Service Data Preparation, Visualisation and Smart Discovery (aka Oracle DV)

The screenshot below from the Compute Classic Service Console shows the various PaaS VMs running on the compute IaaS layer, with other tabs in this console showing network, storage and other infrastructure service usage.

The order in which you install the services is important if you want to associate the various products together correctly, and if like me you’re using a trial account you’ll need to plan carefully to ensure you keep with the various quota limits that Oracle Cloud imposes on trial identity domains that stopped me, for example, allocating the usual two OCPUs to the main Oracle Event Hub Kafka server if I wanted to run the rest of the stack at the same time.

Associating two services together, for example Oracle Event Hub and Oracle Big Data Cloud, connects the two services together automatically in the identity domain network and makes using them together much simpler but unless Event Hub Cloud is provisioned and available when you come to install Big Data Cloud you can’t go back and associate them afterwards, but more annoyingly if you decide you want to use Event Hub Cloud and associate it with Big Data Cloud but since then you’ve provisioned Oracle Analytics Cloud Data Lake Edition and associated Big Data Cloud with that you have to unwind the whole provisioning process and start again with Event Hub Cloud if you want to connect them all together properly. And forget deleting that Database Cloud Service you associated with Oracle Big Data Cloud and then forgot about as you can’t delete services that other services are associated with.

Provisioning each of the services involves giving the service instance a name, assigning storage buckets and OCPU counts to the various cluster nodes you request, and at key points selecting previously provisioned and now running services for association with the one you’re now provisioning. The screenshots below show the three-stage provisioning service for Event Hub Cloud Service — Dedicated:

Provisioning a new Event Cloud Service — Dedicated cluster

and the order in which I provisioned my data lake services, and the important options I chose to make it all work together and within quota, were as follows:

  1. First ensure you have access to Oracle Cloud, and then the various IaaS services: Oracle Cloud Infrastructure Compute Classic, Oracle Cloud Infrastructure Object Storage Classic and Oracle Identity Cloud Service
  2. Provision Oracle Database Cloud Service to store the various Fusion Middleware RCU schemas; in my case I chose Oracle Database 11gR2 as the database type as it avoids the complexity around CDBs and PDBs you get with the 12c database release
  3. Then provision Oracle Event Hub Cloud — Dedicated with one OCPU for the main Kafka VM, one for the REST Proxy VM and another one OCPU for the Kafka Connect VM. For a real deployment you’d want at least two OCPUs for the Kafka service VM but using just the one kept me within my overall OCPU quota limit when installing the rest of the stack
  4. Next step is to provision Big Data Cloud with a single node with the minimum 2 OCPUs, the Full deployment profile and version 2.1 of Apache Spark as that’s the version OAC Data Lake Edition insist on in the next step. When prompted, choose the option to associate Event Hub Cloud and Database Cloud with Big Data Cloud as you won’t get the option to do this again after the initial service provision; once provisioned, open-up the TCP port for Ambari (8080) to the public internet so that OAC in the next step can associate with it — provisioning for OAC failed for me every time until I looked through the provisioning logs and spotted this as the issue
  5. Finally, provision Oracle Analytics Cloud and choose Data Lake Edition as the package option, again in my case assigning a single OCPU and selecting Data Lake Edition as the software version

At that point if you then bring up the Cloud Services dashboard and review the services together for the first time, it’ll look something like this:

Oracle Data Lake stack within Cloud Services Dashboard

Now it’s time to ingest some data and land it into Oracle Big Data Cloud.

The streaming IoT, wearables and social media comms data that I’ll be ingesting into Big Data Cloud will be coming in from the public internet over TCP, and I’ll also want to connect to Event Hub Cloud from my desktop using tools such as Kafka Tool so an addition configuration step I’ll do before setting up anything else is to open-up Event Hub Cloud’s Kafka broker endpoint to the public internet using the Access Rules menu item in the Event Hub Cloud console.

Now I can see the Kafka service and the default topics that Event Hub Service created for me in Kafka tool.

I can either then use Kafka tool to create a new topic to start receiving the first of my data streams, the IoT device event data coming out of Samsung SmartThings, or create the topic as by defining a new Event Hub Cloud service from within the Event Hub Cloud Service — Dedicated console (confusing, but that’s how Kafka topics are named within Event Hub Cloud)

Then it’s just a case of directing the stream of IoT event data to the public Kafka broker endpoint exposed by Event Hub Cloud Service — Dedicated and then, after a short while, checking the metrics for the new Kafka topic that I setup to receive this incoming streaming data.

Getting the data off the Kafka topic and into a Hive table on the Big Data Cloud instance involved the following steps, using Oracle Cloud Infrastructure Object Storage Classic as the intermediate staging layer together with Event Hub Kafka Connect’s OCS Sink Connector:

  1. Configure Event Hub Kafka Connect OCS Sink Connector to push topic events to Oracle Cloud Infrastructure Object Storage Classic (OCS)
  2. Using Zeppelin notebook provided by Big Data Cloud Console, create a CRON job that copies those events across to HDFS storage
  3. Create Hive external tables with location clauses that point to the directories I’ve copied the event files into from OCS

Then, when I go and log into OAC Data Lake Edition and connect to the Hive Thrift Server on the Big Data Cloud instance I can see the Hive tables I’ve just created, and the data that’s now streaming through from my two initial sources via the Kafka service and Kafka Connect running on Event Hub Cloud Service — Dedicated.

In the second half of this two-post series I’ll go deeper into OAC Data Lake Edition and see how its additional transformation and analysis capabilities stack-up against OAC Standard Edition, and also see how it compares to the Oracle Big Data Discovery tool its looking to eventually replace.

Using Looker Data Actions to Make Monzo Spend Analysis More Interactive … and Actionable

The other week I posted a blog about analyzing the bank transaction data Monzo make available to customers via their developer API using Looker, the BI tool I use day-to-day in my role at Qubit as Analytics Product Manager.

One of the dashboards I’ve created shows me how spend looks across a number of merchant categories, compares their average transaction values and higlights the one I visit the most (The Ginger Dog in Brighton, thoroughly recommended) and plots each of the them on a map so I can remember where they’re each located.

If I click on one of the map icons on the Spend by Location visualization Looker then pops-up a dialog showing me the individual column values for this particular merchant — the Hardwood Arms in Fulham, one visit and total/average spend amounts of £75 exactly.

As well as tagging your transactions with metadata from the Foursquare web service Monzo also adds the Twitter handle for each merchant along with their website URL; I’ll start to make this dashboard a bit more interactive then by adding these as link parameters to my merchant_name dimension definition in my LookML view definition.

dimension: merchant_name {
group_label: "Monzo Transactions"
label: "Merchant Name"
type: string
sql: ${TABLE}.merchant_name ;;

link: {label:"Merchant Website"
url:"https://{{ fluentd_monzo.merchant_metadata_website._value }}"
icon_url: ""
link: {label:"Merchant Twitter Profile"
url:"{{ fluentd_monzo.merchant_metadata_twitter_id._value }}"
icon_url: ""

Monzo also provides Google Places API and Foursquare API merchant metadata lookup codes along with each transaction record, so I add two more link parameters to call these two web services.

link: {label:"View in Google Maps"
url: "{{ fluentd_monzo.merchant_address_latitude._value }},{{ fluentd_monzo.merchant_address_longitude._value }}&query_place_id={{ fluentd_monzo.merchant_metadata_google_places_id._value }}"
icon_url: ""
link: {label:"View in Foursquare"
url: "{{ fluentd_monzo.merchant_metadata_foursquare_website._value }}"
icon_url: ""

Now when I click on the name of a merchant within any data visualization using that data point I get a menu of links I can click on for more information.

Clicking on the View in Google Maps link passes across the merchant’s latitude, longitude and Google Places API key to the Google Maps web service, and Looker then opens a new browser tab and gives me all the information I’m looking for using the Google Places web service.

All of this is good to know but it’s only really of value if I can make this data actionable, for example by linking it to the Google Tasks web service I use to record personal, non-project related tasks and reminders. To do this I’m going to use Looker’s Data Actions feature together with Zapier’s integration with Google Tasks to tie it all together.

First, I create a Zapier Zap that listens for a webhook call and then uses the parameter values I provide from the Looker data action to to create a new Google Task. This article on the Looker Discourse website goes into more detail on how to set up Zapier webhooks integration with data actions if you’d like to set up something similar yourself.

Then I add an action parameter into my merchant LookML dimension definition that defines a form with default values, allows me to amend those values if needed and then HTTP POSTs them as the payload to the Zapier web service.

action: {
label: "Add Google Task"
url: ""
icon_url: ""
form_param: {
name: "Task Name"
type: string
required: yes
default: "{{ merchant_name }}"
form_param: {
name: "Due Date"
type: string
required: yes
form_param: {
name: "Action"
type: textarea
default: "Book table at {{ merchant_name }}"
form_param: {
name: "Urgency"
type: select
option: {
name: "Low"
label: "Low"
option: {
name: "High"
label: "High"

Now, another menu link item is displayed to allow me to create Google Task items when I click on merchants in a Monzo transactions dashboard, like this:

When I click on that link a dialog is then shown, giving me the option to amend the task title and details and add a due date and urgency to the task I’m about to define.

Then finally, I press Submit and then go over to the Google Tasks web interface to see my new task recorded and ready for me to action tomorrow morning.

Updates to Oracle Analytics Cloud, Oracle BIEE 12c and Oracle DV Desktop

Late last year I covered one of the first releases of Oracle Analytics Cloud (OAC) on this blog covering the v3 release of OAC; since then the product has had a major UI refresh with OAC v4 and so I thought it’d be worth covering this new release along with updates to OBIEE12c and DV Desktop in a follow-up to last year’s post.

OAC’s original look and feel was based on OBIEE12c v1 that itself contained the first iteration of Oracle Data Visualisation (DV) and whilst the new modern flat UI that release introduced was an improvement on the 11g release before it since then Oracle have rapidly iterated with Oracle DV Desktop and DV Cloud Service and the UI from those products is now the primary user interface for Oracle Analytics Cloud v4.

If you’ve not used Oracle Analytics since the OAC v3 and the initial OBIEE12c versions you’ll be surprised how much the user experience has changed since then; when OBIEE12c first came out most of us still considered Answers and Dashboards to be the primary reporting UI with the BI Repository being the central carefully governed and dimensionally-modelled source of all reporting data.

Since then the world has moved-on and analytics is now all about empowering the end-user through self-service, easy-to-use tools that enable you to do most tasks without having to involve the IT department or do lots of up-front data modeling and data onboarding. Oracle DV Desktop started meeting this need by introducing basic data preparation features into the product so users could upload spreadsheets and other other data and do some basic formatting and tidying before analyzing it on their desktop, and those data upload and preparation features are now available for OAC users in the v4 release. To upload a spreadsheet file into OAC you now just drop it onto the web page and you’re then given a preview and the option to upload its contents to the analytics instance.

After that you can trim, split, derive and change the datatypes of data you upload so that it works in the most optimal way with the analysis features of the product, for example by turning dates in string formats not recognized by Oracle DV into proper date datatypes that can then be used in time-series analysis.

For data transformation needs that go beyond basic formatting and datatype changes you can now build ETL-style routines that filter, aggregate, transform and add external datasets all using simple point-and-click user tools, and if you’ve licensed the new OAC Data Lake Edition these transformations then extend to include model training, forecasting and sentiment analysis. In the example below I’m using OAC Data Lake Edition to forecast weight forward from a set of smart scale readings that’s then executed at the BI Server level using Python maths and statistics libraries.

I’ll actually be talking about OAC Data Lake Edition in my presentation at ODTUG KScope’18 at Walt Disney World Florida this June, where I’ll cover some of these extended data flow features along with Oracle Big Data Cloud and Oracle Event Hub Cloud Service in my session “BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition”.

Cloud of course is great but not every Oracle Analytics customers has made the move yet, or like myself you might have a cloud instance you can spin-up as-and-when you need it but then use regular on-premises OBIEE12c in a VM for your day-to-day learning and experimentation.

With Oracle’s cloud-first strategy and focus on Oracle DV as the main innovation path for analytics features this meant that my OBIEE12c instance running in a Google Compute Engine VM was starting to lag behind OAC v4 and DV Desktop in-terms of look-and-feel and all of these new features, and so I was pleased to note the other day that the (aka v4) release of OBIEE12c is now available for download on OTN that includes many of the features currently available in OAC v4 and DV Desktop (but not the advanced data flow features in OAC Data Lake Edition, for example)

Oracle Data Visualization Desktop has also been updated recently to connect to Oracle Autonomous Data Warehouse Cloud and I’ll be covering that new capability in a blog post in the near future; but more importantly for me at least….

… it also now supports Mac OS X 10.13 High Sierra, great news for us Mac users who had to stop using DV Desktop on our laptops late last year when that most recent update to OS X came out but DV Desktop didn’t yet support it.

Connecting Looker to Oracle Autonomous Data Warehouse Cloud

Earlier in the week I wrote-up my first impressions of Oracle Autonomous Data Warehouse Cloud (ADWC) on this blog, and said at the end I’d follow with another post on how to connect ADWC to Looker, the cloud-based BI tool I usually query Google BigQuery with in my role as Analytics Product Manager at Qubit.

For this initial example I uploaded a few tables of workout and other data into ADWC using SQL Developer, exporting the relevant tables out of BigQuery in CSV format and then uploading them into a user account I created for the purpose in AWDC. The table I’ll concentrate on for this example is the STRAVA_RIDES table, each row detailing an individual cycle workout as recorded through the Strava smartphone app.

The first step in connecting-up Looker to ADWC, or indeed any Oracle Database, is to run some scripts that set up Oracle as Looker expects to find it. Unlike most other BI tools I’ve used with Oracle, Looker expects to connect through a specific user login (“LOOKER”) that is then granted SELECT access to any tables and views you want to report on in other schemas. This user login also needs to have some views and synonyms created to give it access to the V$ system views within Oracle that report on active sessions, and an ability to kill long-running sessions through a PL/SQL package that calls ALTER SYSTEM … KILL SESSION.

The commands to run for regular Oracle databases are detailed on the Looker website but you need to alter them slightly to use ADWC’s superuser account name (“ADMIN”) instead of SYS when initially connecting and when creating the LOOKER_HASH function for symmetric aggregate handling, along with various other changes due to differences in how various objects are named in AWDS vs. regular Oracle Database.

I’ve listed the commands I ran on my ADWC instance below, they should work for you but if not then check out the “Autonomous Data Warehouse Cloud for Experienced Oracle Database Users” section in Using Autonomous Data Warehouse Cloud that explains the differences between the new autonomous and regular Oracle Database server versions.

First create the Looker account and grant the relevant roles and priviledges:

connect ADMIN/<<your_admin_password>>
create user LOOKER identified by <<new_looker_account_password>>;
alter user LOOKER
default tablespace DATA
temporary tablespace TEMP
account unlock;

alter user LOOKER quota unlimited on DATA;
alter user LOOKER default role RESOURCE;
grant select on -- <all tables that will be used by looker>;

Now create the views that Looker uses to understand what sessions are active and the SQL that’s currently being executed to provide data for looks and dashboard tiles:

create or replace view LOOKER_SQL 
from V$SQL sql ,v$session sess
create or replace synonym LOOKER.LOOKER_SQL for LOOKER_SQL;
create or replace view LOOKER_SESSION as 

Next, create the Oracle PL/SQL function that Looker uses as part of symmetric aggregate handling, and a function that Looker can use to “kill” runaway database queries that are taking too long to return results back to you.

create or replace function LOOKER_HASH(bytes raw, prec number)   return raw as   
return(DBMS_CRYPTO.HASH(bytes, prec));
create or replace synonym LOOKER.LOOKER_HASH for LOOKER_HASH;
grant execute on LOOKER.LOOKER_HASH to LOOKER;  
grant execute on ADMIN.LOOKER_HASH to LOOKER;
create or replace procedure LOOKER_KILL_QUERY(P_SID in VARCHAR2,
CURSOR_NAME pls_integer default dbms_sql.open_cursor;
IGNORE pls_integer;
select COUNT(*) into IGNORE
and SID = P_SID
if (IGNORE = 1)
'alter system kill session '''
|| P_SID || ',' || P_SERIAL# || '''',
IGNORE := dbms_sql.execute(CURSOR_NAME);
'You do not own session ''' ||
P_SID || ',' || P_SERIAL# ||
end if;
create or replace synonym LOOKER.LOOKER_KILL_QUERY 

Next over to the Looker configuration. You’ll need to be on the Looker 5.12.12 or higher release with an instance hosted in the US to get the integration working as of the time of writing so that “ADWC” is listed as a connection type and the ADWC wallet integration works; if you’re running Looker as a hosted instance you’ll also need to speak with support to have them copy across the wallet files to the correct location on the Looker server.

To create the connection, enter the following details:

  • Name : Name of your connection, e.g. “rittman_adwc”
  • Dialect : Oracle ADWC (only appears with Looker 5.12.12+)
  • Host:Port : from the TNSNAMES.ORA file in your ADWC wallet zip file
  • Username : LOOKER (as per the account setup in previous steps)
  • Password : password of LOOKER account
  • Temp Database : LOOKER (as per previous steps)
  • Persistent Derived Tables : checked
  • Service Name : From TNSNAMES.ORA in your ADWC wallet zip file
  • Additional Database Params : TNSNAMES.ORA SSL Server Cert DN

To show the Service Name and Additional Database Params fields you first have to save the connection, then tick the “Use TNS” checkbox to reveal the fields. To find your host:port, service name and SSL Server Cert DN values first download the wallet zip file for your ADWC instance from the ADWC Service Console, unzip the archive and then locate the details you need in the TNSNAMES.ORA file as shown below. In my case I chose to use the “medium” ADWC instance type for my connection settings.

Then, save and test your connection. The step that checks that persistent derived tables will probably fail if you try this around the time of my writing as there’s a known bug in the step that checks this feature, it’ll no doubt be fixed soon but if the rest of the checks pass you should be good.

Finally, it’s just then a case of importing your table metadata into Looker and creating explores and a model as you’d do with any other data source, like this:

In this instance I’ve updated the Strava Rides LookML view to turn the relevant metric fields into measures, define a primary key for the view and remove or hide fields that aren’t relevant to my analysis, like this:

Now I can start to analyze my Strava workout data I previously uploaded to Oracle Autonomous Data Warehouse, starting with average cadence and speed along with total distance and Strava’s “suffer score” for each of my workouts:

and then looking to see how much correlation there is between distance and personal strava records being broken on my five longest rides.

In the background, Looker is sending Oracle and ADWC-specific SQL to Oracle Autonomous Data Warehouse Cloud, with the SQL tab in the Explore interface showing me the actual SQL for each query as its sent.

Should I wish to check how much of the storage and CPU capacity available for my ADWC instance is being used, I can do this from ADWC’s Service Console.

So there you have it — Looker powered by an Oracle Autonomous Data Warehouse Cloud, and no need for an Oracle DBA to get it all running for you.

First Impressions of Oracle Autonomous Data Warehouse Cloud

Regular readers of this blog will know that my day-to-day work involves ingesting, storing and then analyzing petabytes of data using Google BigQuery, one of a new generation of fully-managed, cloud-hosted data warehouse platforms-as-a-service. Services like BigQuery take care of all the database infrastructure management and capacity planning leaving you just to define tables, load data into them and then run queries using SQL or your favourite BI tool.

Oracle recently announced Oracle Autonomous Data Warehouse Cloud (ADWC) that competes in this market and given my background in Oracle technology I was interested to see how it stacked-up compared to BigQuery and similar services from the likes of Amazon AWS and Snowflake. To do this I took advantage of Oracle’s free $300 service credit offer and signed-up for an ADWC instance – note that at the time of writing the ADWC service is only available in Oracle Public Cloud’s North American datacenter region as opposed to the European ones so make sure you choose the former when creating your Oracle Cloud account.

Setting up an Oracle on-premises database was something once-upon-a-time I’d spent months and thousands on course fees studying for certifications to be able to do confidently, so it was nice to see how things had got a lot simpler with setting up an ADWC instance with the first step being to select ADWC from the Oracle Cloud Services menu, like this:

Then answer a few simple questions: the name to give the instance, number of CPUs and amount of storage to set aside initially, and the password for the administrator account.

Setup of an instance takes a few minutes, then its ready. Not bad.

The next big change you notice compared to Oracle’s customer-managed Database Cloud Service is the replacement of Enterprise Manager Database Express 12c and DBaaS Console with a single Service Console, accessed not through the usual SYS and SYSTEM superuser accounts but through a new superuser account called ADMIN, presumably adopting a naming standard more familiar to users of other SaaS and PaaS platforms.

The Service Console contains a couple of performance-tracking pages with the emphasis on read-only viewing of those stats vs. giving you controls to fine-tune database parameters or table structures, and a page where you can download a zip file containing a wallet directory with TNSNAMES.ORA and encrypted connection details to your ADWC instance.

I seem to remember at the time of Autonomous Data Warehouse Cloud’s launch that many in the Oracle community (myself included, most probably) said that all that the “autonomous” part of Autonomous Data Warehouse Cloud did was automate the Oracle Database maintenance tasks that more recent data warehousing platforms-as-a-service handled for you automatically. Another common critique at that time was that under the covers ADWC was more scripting and use of existing “advisors” than true autonomous artificial intelligence-style database management.

To be honest though, it doesn’t matter; compared to running a regular Oracle database you just fill in a short form and the platform runs without any further intervention from you, you don’t need to know how it works, job done in terms of competing with BigQuery and Redshift for ease-of-use and maintenance.

Connecting to ADWC is a bit different to regular Oracle Database connections in that it’s easier — no need to specify host, port and SID/service name — but this involves downloading a ZIP file wallet and the tool you wish to connect having support for that wallet. Oracle SQL Developer in its most recent 18c version has that support making it the obvious tool to define new users, create tables and upload data into them and then run simple SQL queries against your data.

Oracle ADWC is based on Oracle Exadata Database Server technology and supports PL/SQL, full INSERT/UPDATE/DELETE support in SQL as well as SELECT, parallel query (enabled by default) and hybrid columnar compression.

What you don’t get are indexes and materialized views as these generally aren’t considered necessary when running on Exadata hardware that’s optimized for full table scans of data arranged in compressed columnar format, and as ADWC charges per CPU per hour there’s no penalty in scanning lots of detail-level data unnecessarily as you get with BigQuery.

Also missing are features such as the Oracle OLAP Option, Oracle Spatial and Graph and Oracle Text that I’m guessing might get implemented at a later date, along with management features such as Oracle Tuning Pack that are no longer needed when the autonomous features of the platform do the database tuning for you.

Price-wise, although as I said a moment ago you do get charged by the amount of data you scan (and a bit for storage) with Google BigQuery, Oracle AWDC seems to be more expensive to run at least at the data volumes I’d be using as a developer initially evaluating the product.

If you commit to monthly charging then ADWC costs around $1200/cpu/month whereas Google charge $5/TB for queries with the first 1TB each month free plus a token amount for storage and streaming inserts, which meant my bill last month for BigQuery came to just $1.53 and my total bill for all Google Cloud Platform services including multiple hosted VMs and extensive usage of ther NLP and Geocoding APIs took the total up to just over $150. Clearly Google are aiming more at the startup market whereas Oracle are serving enterprise customers, but don’t make the assumption that just because AWDC is elastic it’s also super-cheap for hobbyist use, long-term.

So now its a case of loading some sample data into ADWC using SQL Developer and then running a couple of test queries to see it working in action with my data.

Tomorrow, I’ll show how Looker be connected using the newly announced support it has for Oracle ADWC and later on in the week do the same for Oracle Analytics Cloud and DV Desktop.

Timeline Charts, Derived Tables and Analytic Functions in Looker 5

One of the data streams that comes into my personal Google BigQuery data warehouse is a record of all the Hue light switch settings and illuminance levels, Nest thermostat settings and motion sensor readings coming in via the Samsung Smart Things hub under the stairs back home. For the Hue lightbulbs I get a stream 1s and 0s that initially get stored in a four-column table in BigQuery, like this:

I then create a SQL view over that table that joins to a lookup table containing device types and room allocated for each of the devices in the data stream, and sets a metric type for each recording so that I can analyze all the thermostat readings together, bring together all the light switch settings and so on.

Its then quite straightforward to bring this SQL view into Looker and add it to the rest of the data streams I’ve brought together there so I can see, for example, whether the kitchen lights get turned-off after everyone’s gone upstairs after breakfast or whether I should try and automate those as well as my kettle.

Looker 5 introduced another way to visualize data with defined start and end events in the form of the new Timeline Chart, a feature I tweeted about the other day when I brought data on tasks from Asana into my BigQuery data warehouse via a Fivetran integration.

It’d be interesting to use this type of chart to show how one particular light switch changed from on to off and then on again over a period of time, or take all the Hue lights for a given part of the house and see whether anybody turns the lights off ever … or whether they need to be linked to the motion sensors and automatically turned-off after a period of inactivity in each room.

To use this type of chart I first need to create an “end date” field for each of my event records, so each timeline has a start and end date with other fields being optional for label and value. Looking at the list of events for the light switch I looked at earlier and ordering the events by time recorded you can see the information I need is there:

device = 'kitchen seating area spot 1'
AND raw_value IN ('off','on')

What this calls for is the LEAD BigQuery Standard SQL analytic function that returns a value based on a subsequent row to the one we’re working with, and I can combine that with the TIMESTAMP_DIFF function to compute the time between the start and end event dates as that would be useful to know too. The SQL expressions for the two new fields would therefore be:

LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds

I could add these column expressions into the SQL view definition in BigQuery and order the rows by device, metric and timestamp, like this:

date_time as date_time,
device as device,
LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) as end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds
from `aerial-vehicle-148023.personal_metrics.smartthings_readings`
where device = 'kitchen seating area spot 1'
and metric = 'switch'
and raw_value in ('on','off')
order by 2,3,1

Executing that SQL shows the logic is working, but I’d then have to maintain that view within BigQuery and that might not be the most convenient place to add new code.

Instead, I could just go into Looker and create a new view there based on a derived table SQL expression and do the work there. Right now my LookML model looks like the excerpt below, where you can see the sensor readings view joined into the rest of the explore so all my data can be analyzed together.

connection: "rittman_bigquery"
include: "*.view.lkml" # include all views in this project
include: "*.dashboard.lookml" # include all dashboards in this project
explore: date_dim {
case_sensitive: no
label: "Data Warehouse"
join: fluentd_transactions {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_transactions.date_minute5} ;;
relationship: many_to_many
join: smartthings_readings {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${smartthings_readings.date_minute5} ;;
relationship: one_to_many
join: fluentd_uber_rides {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_uber_rides.date_minute5} ;;
relationship: many_to_many

I now create a new LookML view that uses a derived table SQL query as the view definition rather than simply referencing an existing BigQuery table. Note how I’ve used the same view_label as the main LookML view containing my event data, so that the dimensions and metric I define here appear alongside all the other smart device fields the same explore view.

view: device_event_end_and_timespan {
view_label: "6 - Smart Devices"
derived_table: {
date_time as date_time,
device as device,
metric as metric,
LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) as end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds
from `aerial-vehicle-148023.personal_metrics.smartthings_readings`
order by 2,3,1;;
dimension: date_time {
type: date_time
hidden: yes
sql: ${TABLE}.date_time ;;
dimension: device {
type: string
hidden: yes
sql: ${TABLE}.device ;;
dimension: metric {
type: string
hidden: yes
sql: ${TABLE}.metric ;;
dimension_group: end_date_time {
group_label: "End Date"
label: "End"
type: time
timeframes: [
sql: ${TABLE}.end_date_time ;;
dimension: pk {
primary_key: yes
hidden: yes
type: string
sql: concat(cast(${TABLE}.start_date_time as string), ${TABLE}.device) ;;
measure: value_duration_seconds {
type: average
sql: ${TABLE}.value_duration_seconds ;;

Then I join this derived table view back into the explore within my LookML model, as below:

connection: "rittman_bigquery"
include: "*.view.lkml" # include all views in this project
include: "*.dashboard.lookml" # include all dashboards in this project
explore: date_dim {
case_sensitive: no
label: "Data Warehouse"
join: fluentd_transactions {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_transactions.date_minute5} ;;
relationship: many_to_many
join: smartthings_readings {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${smartthings_readings.date_minute5} ;;
relationship: one_to_many
join: device_event_end_and_timespan {
type: left_outer
sql_on: ${smartthings_readings.date_time} = ${device_event_end_and_timespan.date_time} and
${smartthings_readings.device} = ${device_event_end_and_timespan.device} and
${smartthings_readings.metric} = ${device_event_end_and_timespan.metric};;
relationship: one_to_one
join: fluentd_uber_rides {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_uber_rides.date_minute5} ;;
relationship: many_to_many

and now I can create timeline charts that show me which lights were on over a particular period of time, like this:

or more importantly, work out why the bathroom never seems to be free when you’ve got two teenage kids in the house.