Posts tagged Big Data
Date Partitioning and Table Clustering in Google BigQuery (and Looker PDTs)

Google BigQuery is a data warehousing-orientated “table-as-a-service” product from Google Cloud Platform that, like Oracle Exadata, optimizes for full-table scans rather than selective row access and stores data organized into columns, rather than rows, to align better with filtering and aggregation workloads associated with data warehousing workloads.

BigQuery charges by amount of data stored in your tables and the data you’ve read in executing SQL queries against those tables, so BI tools such as Looker that work efficiently with BigQuery only request the columns they need when querying BigQuery datasets rather than running a SELECT(*) and throwing away what’s not needed.

To illustrate this let’s run a query that requests all columns (“SELECT (*) FROM …”) from a BigQuery table, and as you can see from the screenshot below it’s reading through all 1.47GB of table data to return that full set of columns to the user.

a7244-1s1cz8nz6xn9lmpduml8ydq.png

If the users’ query only really needed just two of those columns, requesting just those brings down the amount of data read to just under 10MB as most of of that table’s data is stored in columns of data that aren’t needed for the query.

a97d7-1lennmqhbbtooktt5ptlf8w.png

BigQuery historically has supported table partitioning based on the date you loaded data into the table which sounds great until you realise that it’s the transaction date, not the ingest date, that most of your user’s queries filter against.

You could also use table decorators in Legacy SQL to point to the particular day partition your data was stored within but this only went back for a maximum of seven days and required your query tool to support this non-standard feature of SQL; earlier in this year though Google introduced a more flexible form of date partitioning as a beta release feature that allows you to choose the date column your table would be partitioned by, and more recently introduced a feature called table clustering that stores data within a table sorted by the columns you most commonly filter on when running queries against it.

To show how date partitioning and table clustering work, I’ll start by running a query to return just a month’s data from the five years of data held within my table; as you can see in the screenshot below, BigQuery performs a full table scan and reads through all 1.37 GB of data in the table to return just the month of data my query requested.

c53c5-1vxz8mhgaifybfhtmejyb_a.png

Standard SQL now supports DDL commands such as CREATE TABLE and CREATE TABLE … AS SELECT, along with a PARTITION BY clause that lets you specify a timestamp or date column to partition the table by. I’ll use these new features to create a copy of that same table, this time partitioned by the timestamp column I’m filtering on in my query predicate …

2665b-1fao4duvhkc3txox88gci2q.png

… and the DDL statement fails. What’s happened there then?

Turns out that BigQuery tables are limited to 2500 partitions for the moment, with any one single load operation limited to 2000 (hence the error) and with partitioning limited to just date and timestamp columns and partitions a day in length it means any one table can only hold around five years worth of data, beyond that you’ll need to create multiple date partitioned tables and UNION them together through a SQL view.

For now though I load my table with just five years of data and then re-run the query that requests a single day from that overall five years; now BigQuery has only read and processed 57 MB of data and it’d be a fraction of that if I only requested the two columns I needed, not all columns from that table.

87da5-134v1dilwz1wa7yaa884cja.png

But what about queries that filter against the other columns in this table? We can’t set up table partitioning on STRING, INTEGER or any other type of column datatype so my original query if re-run against the date partitioned table reads just as much data as it did before.

0264a-1skwptoedkk3wqhwnq6adgw.png

What we could do is re-create the table with its data pre-sorted by those two particular columns using another new feature called table clustering, so that queries that filter against those columns find the blocks of data they’re looking for faster and can skip completely the ones that don’t.

If like me you’re more familiar with databases such as Oracle, “table clustering” is all about storing data from tables sharing a common cluster key together in the same data blocks, so that queries against that group of tables filtering on that key return data faster.

Table clustering in BigQuery is more analogous to loading regular Oracle tables using data from a pre-sorted file and comes with the same benefits and limitations; in BigQuery’s case it takes care of the pre-sorting and table configuration for you but the same limitations still apply around how you filter the table and what happens when you load more data afterwards.

Let’s set-up a clustered table now that stores its data ordered by the two columns used in the query I ran a few moments ago.

a359f-1wbee2qfj8c47tclrvwbzag.png

Now when I run a query filtering on those columns against this partitioned, clustered table the amount of data read goes down compared to before, and results are returned a bit faster; if I included the partition key column in the query as well, returning just a few days’ data, it’d be faster still.

78852-1xuzxjgxdqunww6_kaua_iw.png

But queries that filter using any of the other columns don’t benefit from this clustering and moreover, the benefits of sorting the data initially loaded into a clustered table are lost over time as you load more (unsorted) data into it, meaning that to really benefit from clustering a table you have to rebuild it regularly using a CREATE TABLE … AS SELECT.

Table clustering in BigQuery is nice to have but similar to pre-sorting a direct path load into Oracle database tables, it’ll take a lot of rebuilding and careful querying to get the most benefit from and with the size of most BigQuery tables, I doubt that rebuilding will happen much in-practice.

BI tools such as Looker can make use of table partitioning and clustering in queries straight away as no changes are required in the query SQL you write, everything is handled in the table definition. Where you might want to set up partitioning yourself as a Looker developer is for the persistent derived tables (PDTs) that Looker can create to materialize the results of a view you define in your Looker model to derive a dimension attribute using a subquery, for example to calculate the sequence number of a users’ order for retention analysis for an eCommerce dashboard as shown in the screenshot below.

58c43-173bzrtzmxplazcskhh-wig.png

Looker has for some time come with database-specific settings for particular source database types such those used for Redshift used in the example above, and now supports date partitioning for persistent derived tables through a new partition_keys setting as announced in this recent Looker forum post.

454f9-12hjjxxrkazidhy53fjafbw.png

Finally, if you’re interested in how partitioning is developing as a feature within BigQuery, and some of the edge-cases and requests for additional functionality that users of partitioning are asking for, this feature request on the BigQuery issue tracker is useful to read-through.

Extending support to more than 2500 partitions seems to be the most popular request along with allowing integer and string datatype columns to be used for the partition key, but also look out for issues around re-loading data into historic partitions and the cost and work involved in rebuilding large tables to re-cluster its data or apply partitioning for the first time.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.3

In this series of three blogs on Oracle Analytics Cloud Data Lake Edition I’ve setup an object store data lake in Oracle Cloud using Oracle Big Data Cloud and Oracle Storage Cloud, and ingested streams of real-time event data from IoT and social media sources into Oracle Cloud’s object storage service using Oracle Event Hub Cloud Service.

The event-stream data I staged into Storage Cloud was then copied into parquet files on HDFS and then presented out to BI and ETL tools through Big Data Cloud’s Thrift Server interface, so that now I’m ready, after a short diversion into defining the data engineer role that would typically work with this new product edition, to start exploring some of Oracle Analytics Cloud Data Lake Edition’s new data flow and predictive model preparation features.

The diagram below shows where OAC Data Lake Edition fits into my project architecture, performing the tasks of transforming and enriching the incoming dataset and then presenting my at-scale data out to end-users for analysis using OAC Data Lake Edition’s Data Visualization features.

Looking at the homepage within OAC Data Lake Edition I can see my two Hive tables listed within the dataset catalog, alongside other datasets I’d uploaded directly into OAC. This visual catalog of available datasets is also the new homepage interface that OBIEE12c 12.1.0.4.0 now adopts, with both cloud and on-premises versions of Oracle’s BI tools now relegating the old “Answers” homepage to something you have to dig around and specifically look for in favour of this more self-service Data Visualization starting page.

I’ll have to write an article on Answers and how powerful its interface is, and the full dimensional model it exposes from the Oracle BI Repository, in a blog post sometime in the future as it’s almost in danger of getting forgotten about.

Moving on though, the first transformation I need to do on all the incoming datasets is to take the timestamp column in each table and convert it to a format that OAC recognises as a valid TIMESTAMP datatype format, then convert those columns to TIMESTAMPs so that DV can automatically enable time-series analysis by day, month, quarter, hour and so on. I do that using a feature that’s also present in OAC Standard Edition, the lightweight data preparation interface that’s presented to users when they first add a new data source into OAC’s dataset catalog, shown in the screenshots below.

Where OAC Data Lake Edition gets really interesting right now both in terms of differences vs. the on-premises versions of OBIEE I used to use, and in terms of it’s “data engineering” potential, is with a feature called Data Flows.


Most self-service BI tools now have a basic data loading and data preparation capability today with Tableau Data Prep being one of the latest examples. Designed to handle more complex data prep use-cases than basic datatype changes and field-splitting, they give end-users the ability to do this type of work themselves rather than trying to do it in Excel or handing the work off to the IT department and having to wait days or weeks to get that data back.

Data Flows are a feature that’s been introduced since the original on-premises version of OBIEE12c that I last used when working out in consulting, and provide you with what’s effectively a lightweight, multi-step ETL tool that executes transformations using the BI Server’s Model Extension feature, introduced back when OBIEE12c first came out as the mechanism to enable on-the-fly data mashups between server-side and user-uploaded datasets.

Looking at the transformation operators available in OAC Data Lake Edition v4 there’s quite a few that apply to data lake and data engineering-type workloads including running Python statistical analysis scripts and predictive model training and model build; there’s also an operator for creating an Essbase Cube, with Essbase in this instance positioned as a fast ad-hoc analysis back-end for use with the data visualization part of OAC.

For now though there’s two transformation tasks I want to do with my Hive datasets; first, enrich the incoming social media data by analyzing the sentiment in each tweet and then writing the data plus this sentiment tagging back to the Oracle Big Data Cloud environment, so that I can then turn those sentiment tags into a score and create a data visualization showing who sends me the most tweets and how crazy they are overall.

The second data enrichment I wanted was on some Strava cycling workout data I’d uploaded directly into OAC using the CSV file upload facility; using the model train and build Data Flow operators I defined a model to predict how many “kudos”, the Strava equivalent to Facebook “likes”, I’d get for a given cycle workout with a number of different variables available to the model in order to make the prediction — for example, distance and elevation gain, map location, effort expended and so on.

Then, after running the model build step and looking at the predicted values and the actual ones for the remainder of the dataset not used for model training, you can see the predicted kudos values are fairly in-line with the ones I actually recorded for those rides.


Another feature that’s now in Oracle Analytics Cloud is automated data diagnostics, or Explain. Explain uses machine-learning libraries and that same model extension/XSA BI Server framework to help users quickly understand the value distribution and statistically correlated driving factors for a particular dataset, and learn which segments or cohorts have the highest predictive significance. Enabled by a bunch of extensions to BI Server logical SQL I used the feature first on the sentiment scoring I’d performed earlier on, and then on the steps data I’d brought into Oracle Big Data Cloud from my Fitbit device, after converting the numeric step counts into a text attribute by bucketing its values into low, medium and extreme bucket values.

This is pretty powerful stuff, with automated understanding and context-gaining about new datasets being one of the most user-enabling features I’ve seen arrive recently in BI tools with the best example of this being BeyondCore, now part of Salesforce Einstein. OAC lets the user pick the most useful of the Explain facts and driver insights and publish them to a Data Visualization dashboard like the one below, showing the most predictive and significant variables in my dataset that influence the steps I take each day.

Which leads neatly to the final “data at-scale” feature in OAC, the Data Visualization feature that in my case is querying the ingested, transformed and now enriched datasets I’ve got running on my Oracle Big Data Cloud instance alongside Oracle Event Hub Cloud and Oracle Analytics Cloud Data Lake Edition.


Thank you once again to the Oracle ACE Director program for providing access to Oracle Analytics Cloud Data Lake Edition, Oracle Big Data Cloud and Oracle Event Hub Cloud services over the past few weeks. If you’re looking to try these new services out there’s free trials available for most of Oracle’s Cloud products and many of the new features are also available in Oracle Data Visualization Desktop 12c and Oracle Business Intelligence 12c, both of which can be downloaded for training and evaluation under the OTN license scheme.


Wrapping-up this three part series on Oracle Analytics Cloud Data Lake Edition and Oracle Big Data Cloud I’d like to go back to the two (serious) questions I asked myself at the end of the previous post:

  1. Has OAC Data Lake Edition got anything actually to do with data lakes, and is it a useful tool for aspiring Oracle technology data engineers?
  2. How does it compare to my old favourite Oracle big data product Oracle Big Data Discovery, officially still available and not quite dead yet but existing in some strange zone where the on-premises version stopped getting updates a while ago and the cloud version is for sale but you can’t buy it unless you know the right person to ask and he’s actually gone to Cloudera

So has Oracle Analytics Cloud Data Lake Edition got much to do with actual “data lakes”? Well … it integrates with Oracle Big Data Cloud and apparently comes with an option to run those data flow transformation in Big Data Cloud’s Apache Spark environment, though to be fully-transparent I didn’t see that as an option when doing my evaluation so can’t comment on how well or not that works.

Like Oracle Big Data Discovery before it, OAC Data Lake Edition makes you structure your incoming event stream data with Hive table metadata before you can work with it, but that’s actually fairly standard practice with most data visualization tools that work with Hadoop and data lake environments.

Having Essbase in this product package, alongside the data lake functionality, did make me scratch my head a bit and wonder, “why?” — data lakes and Essbase are about as opposite as you can get in terms of target users and use-cases and I think this Data Lake Edition is as much about creating a product package and price point that’s mid-way between the OAC Standard and Enterprise Edition.

But there is some logic to having Essbase in this edition; it provides a set of easy-to-use loading and preparation tools for Essbase making it easier for customers new to that product to start using it, and Essbase with its latest hybrid ASO/BSO storage format is surprisingly scalable and blindingly-fast to query, a great potential back-end for enabling data analysis “at-scale” using Oracle Analytics Cloud’s data visualization features.

I also get the feeling that this initial v4 version of OAC Data Lake Edition is more of an initial first-cut release to get something out to customers, establish the product package and validate the roadmap and market assumptions. Oracle Analytics Cloud v5 isn’t too far off and I’d expect incremental improvements and new features in areas such as natural language processing and machine learning built-into the developer experience; I wouldn’t be surprised to see Oracle Big Data Preparation Cloud making its way into the product given its obvious fit and overlap with Data Lake Edition’s data prep features.

But where I really see an interesting future for OAC Data Lake Edition is when it starts to integrate product features and the development team from Oracle’s recent acquisition of Sparkline Data.

I came across SNAP, Sparkline‘s platform for building OLAP-style dimensional models over data lakes and cloud object storage layers about a year ago when researching analytics platforms at Qubit and quite frankly, it’s as revolutionary in terms of todays data lake analytics market as OBIEE (or the nQuire Server) was back in 1995 with its virtual data warehouse over application and data warehouse sources.

Take these two slides from the Sparkline website and imagine them as the future of Oracle Analytics Cloud analyzing event-streams and big data cloud-hosted datasets…

and you can see why I’m keen to see where Oracle Analytics Cloud Data Lake Edition goes over the next couple of years. I’m speaking on OAC Data Lake Edition at ODTUG KScope’18 in Orlando in just a couple of weeks time so come along if you’re there, it should be an interesting talk.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.2

A couple of weeks ago I posted a blog on Oracle Analytics Cloud Data Lake Edition and the Oracle Cloud big data products you’d typically use to create a cloud-hosted, object storage data lake. In this and the final post in the series I’m going to look at some of the data transformation and predictive model building features that come with the Data Lake Edition of OAC and extend the base capabilities in this area that OAC Standard Edition comes with, and the data engineering role that uses the data lake this Oracle Analytics Cloud product packaging option is designed for.

As a reminder, OAC Standard Edition is more-or-less Oracle Data Visualization running as a single-tenant cloud-hosted VM (cluster) that’s either customer managed, or Oracle-managed if you go for the new Autonomous Analytics Cloud that interestingly is charged at a lower hourly rate than the customer-managed version; presumably it costs Oracle less to run OAC when it’s doing all the administration rather than letting customers control the settings, which makes sense.

It does make you wonder when all this VM nonsense will go away though as well as the distinction between OAC and the DVCS/BICS managed service cloud analytics products, with all of these services just becoming packaging options for a multitenant service partitioned by customer instance as the likes of Looker and Google Cloud Platform do today.

But for now, Oracle Analytics Cloud Data Lake Edition is described on the product homepage as having the following high-level capabilities compared to OAC Standard Edition:

  • Accelerated analysis via Essbase
  • and support for creating scalable, shareable high performance data flows.

whilst the product documentation states that it includes all the OAC Standard Edition features, plus:

  • Explore and replicate data
  • Create scalable data flows including ingestion, transformation, and persistence
  • Execute data flows by using Apache Spark or Oracle Database Cloud

with two edition-specific scenarios listed:

  • Explore Data Lakes, described as “You can create data flows to visualize data in your data lakes”, and
  • “Replicate Data in Data Lakes”, explained as “You can replicate data from Oracle Service Cloud (RightNow) into Oracle Big Data Cloud Service, Oracle Database Cloud, and Oracle Database. For example, you can replicate data from customer service records in Oracle Service Cloud into your data lake on Oracle Big Data Cloud Service and then perform further data preparation and analysis with data flows and data visualization projects.”

I’m guessing most readers won’t have heard of Oracle Service Cloud but for Oracle Cloud customers it’s quite a useful initial integration, as Service Cloud is Oracle’s cross-channel digital customer call centre application that brings together customer service interactions across web, social media and offline service channels. More interestingly though and you may have missed it back at the start but Data Lake Edition is the edition that ships Essbase, Oracle’s multi-dimensional OLAP server, in what almost seems like an afterthought right now but will become, I suspect, a more significant defining feature of this product package as time goes on.

For the purposes of this series of blog posts though, what I’m really interested in are the data engineering features in OAC Data Lake Edition — but before I start looking into what they might be, I guess I’d better explain what I mean by a data engineer.


I first came across the term data engineer in a blog written by Curt Monash — still the #1 person I’d love to have come on the Drill to Detail Podcast, if you happen to read this Curt — where he wrote:

“I learned some newish terms on my recent trip. They’re meant to solve the problem that “data scientists” used to be folks with remarkably broad skill sets, few of whom actually existed in ideal form. So instead now it is increasingly said that:

- “Data engineers” can code, run clusters, and so on, in support of what’s always been called “data science”. Their knowledge of the math of machine learning/predictive modeling and so on may, however, be limited.

- “Data scientists” can write and run scripts on single nodes; anything more on the engineering side might strain them. But they have no-apologies skills in the areas of modeling/machine learning.”

I think it’s probably true to say that the term really started to enter people’s consciousness, at least the sorts of people I hang around with, when Maxime Beauchemin from Airbnb wrote his seminal blog “The Rise of the Data Engineer” where he introduced this new discipline as:

“Like data scientists, data engineers write code. They’re highly analytical, and are interested in data visualization.

Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science.

In relation to previously existing roles, the data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.

In smaller companies — where no data infrastructure team has yet been formalized — the data engineering role may also cover the workload around setting up and operating the organization’s data infrastructure. This includes tasks like setting up and operating platforms like Hadoop/Hive/HBase, Spark, and the like.”

Maxime’s blog struck a chord with me as that’s pretty much how I’d describe what I do, or at least it’s the work I most enjoy doing; building big data systems to support BI and data warehousing workloads where my knowledge of how Hadoop, cloud and distributed data platforms complements the knowledge that other colleagues have around data science, machine learning and statistical analysis though importantly, I know enough about those areas to be able to deploy the models they build (or as we say over in the UK at least, I know enough about them to be dangerous).

I interviewed Maxime about his blog and the topic of data engineering on the Drill to Detail Podcast last May and if you’ve not done so already I’d recommend you give the episode a listen; for me though, having just spent the last eighteen months working in the London tech startup scene, my definition of a data engineer would be someone who’s just out of university and has an ironic beard, thinks he’s made ETL tools obsolete but is slowly, painfully, recreating one and is now just on the brink of discovering code templating, deployment contexts, change control and data lineage…

… but earns twice what you earn and gets all the girls. Because, data engineers are even more essential than data scientists if you’re serious about building a data lake, and data lakes are the new data warehouses.


Which bring us back to Oracle Analytics Cloud Data Lake Edition and these three blog posts. The first part of this three part series talked about Oracle Big Data Cloud and products such as Oracle Event Hub Cloud — Dedicated (who names these products at Oracle? Seriously?) and in this second I seem to have got rather diverted by trying to define what Oracle Analytics Cloud Data Lake Edition actually is and then even more so by defining the supporting role of a data engineer; in the final post tomorrow I’ll take a look at some of the data transformation, model definition and deployment and data engineering features in OAC Data Lake Edition and then, finally, conclude by asking myself two questions — OK, three:

  1. Has OAC Data Lake Edition got anything actually to do with data lakes, and is it a useful tool for aspiring Oracle technology data engineers?
  2. How does it compare to my old favourite Oracle big data product Oracle Big Data Discovery, officially still available and not quite dead yet but existing in some strange zone where the on-premises version stopped getting updates a while ago and the cloud version is for sale but you can’t buy it unless you know the right person to ask and he’s actually gone to Cloudera
  3. Will the ironic beard and quiff be enough, or do I need to go full ginger if I’m going to really be taken seriously as a data engineer?

Final part in the series on Oracle Analytics Cloud Data Lake Edition comes later this week.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.1

Some time ago I posted a blog on what analytics and big data development looked like on Google Cloud Platform using Google BigQuery as my data store and Looker as the BI tool, with data sourced from social media, wearable and IoT data sources routed through a Fluentd server running on Google Compute Engine. Overall, the project architecture looked like the diagram below…

… and I’ve got more-or-less the same setup running right now, with an additional GCE VM running Confluent Open Source to feed a subset of the event streams into a Druid cluster that I’m using to test out Looker, Superset and Imply for sub-second ad-hoc query analysis use-cases. More on that soon.

If you’re a regular reader of this blog you might recall a week or so ago I posted a blog on the new releases of Oracle Business Intelligence Enterprise Edition (OBIEE), Oracle Analytics Cloud (OAC) and Oracle Data Visualization Desktop (DVD) and mentioned a new packaging option for their cloud analytics product, Oracle Analytics Cloud Data Lake Edition. I’ve got a particular interest in what this product might be as I used to use the product it replaces, Oracle Big Data Discovery (BDD), fairly extensively in Oracle big data analytics and data lake projects a few years ago.

And Oracle Big Data Discovery was — technically at least — a great product. It combined the search and analytics features of Endeca Information Discovery with the scale and data transformation abilities enabled by Apache Spark and Hadoop, but suffered perhaps by being a bit ahead of the market and by not having any obvious integration with the rest of Oracle’s analytics and data management tools. By contrast Oracle Analytics Cloud Data Lake Edition is one of three packaging options for Oracle Analytics Cloud and includes all of the functionality of OAC Standard Edition (Oracle Data Visualization together with basic data preparation tools) as well as itself being a subset of the wider set of analysis, dashboarding and enterprise data modeling features in OAC Enterprise Edition.

An equivalent product architecture for ingesting, transforming and analyzing my IoT, wearables and social media data in Oracle Cloud would look something like the diagram below, with the following Oracle Cloud Platform-as-a-Service (PaaS) products used for ingest, storage and analysis:

  • Oracle Event Hub Cloud Service: Apache Kafka running either customer or Oracle-managed with full access to Kafka’s REST Proxy and Kafka Connect
  • Oracle Big Data Cloud: Oracle’s new elastically-scalable Hadoop platform running Apache Spark, Ambari and other Hortonworks Data Platform components
  • Oracle Analytics Cloud Data Lake Edition: Oracle Data Visualization combined with more extensive lightweight ETL (“data flow”) components, text analytics and machine learning model training and build capabilities

In this example I’m using Apache Hive and Parquet storage as my column-orientated data platform but of course I’ve now also got Oracle Autonomous Data Warehouse Cloud as an option; I’ll stick with Hive on Oracle Big Data Cloud for now though as this gives me the option to use Apache Spark to transform and wrangle my data and for building machine learning models using SparkML and, via pySpark, Python Pandas. In what’s the first of two posts in this short series I’ll be looking at how the data pipeline is set up, and then in the second post I’ll look at Oracle Analytics Cloud Data Lake Edition in detail focusing on the data transformation, data engineering and data science features it adds beyond OAC Standard Edition.


The development environment I put together for this scenario used the following Oracle Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) components:

  • Oracle Compute Classic and Storage Classic Services
  • Oracle Database Cloud Service, with the 11g database option
  • Oracle Event Hub Cloud Service Dedicated, with Kafka Connect and REST Proxy nodes
  • Oracle Big Data Cloud, single node with Hive, Spark 2.1, Tez, HDFS, Zookeeper, Zeppelin, Pig and Ambari
  • Oracle Analytics Cloud Data Lake Edition with Self Service Data Preparation, Visualisation and Smart Discovery (aka Oracle DV)

The screenshot below from the Compute Classic Service Console shows the various PaaS VMs running on the compute IaaS layer, with other tabs in this console showing network, storage and other infrastructure service usage.

The order in which you install the services is important if you want to associate the various products together correctly, and if like me you’re using a trial account you’ll need to plan carefully to ensure you keep with the various quota limits that Oracle Cloud imposes on trial identity domains that stopped me, for example, allocating the usual two OCPUs to the main Oracle Event Hub Kafka server if I wanted to run the rest of the stack at the same time.

Associating two services together, for example Oracle Event Hub and Oracle Big Data Cloud, connects the two services together automatically in the identity domain network and makes using them together much simpler but unless Event Hub Cloud is provisioned and available when you come to install Big Data Cloud you can’t go back and associate them afterwards, but more annoyingly if you decide you want to use Event Hub Cloud and associate it with Big Data Cloud but since then you’ve provisioned Oracle Analytics Cloud Data Lake Edition and associated Big Data Cloud with that you have to unwind the whole provisioning process and start again with Event Hub Cloud if you want to connect them all together properly. And forget deleting that Database Cloud Service you associated with Oracle Big Data Cloud and then forgot about as you can’t delete services that other services are associated with.

Provisioning each of the services involves giving the service instance a name, assigning storage buckets and OCPU counts to the various cluster nodes you request, and at key points selecting previously provisioned and now running services for association with the one you’re now provisioning. The screenshots below show the three-stage provisioning service for Event Hub Cloud Service — Dedicated:

Provisioning a new Event Cloud Service — Dedicated cluster

and the order in which I provisioned my data lake services, and the important options I chose to make it all work together and within quota, were as follows:

  1. First ensure you have access to Oracle Cloud, and then the various IaaS services: Oracle Cloud Infrastructure Compute Classic, Oracle Cloud Infrastructure Object Storage Classic and Oracle Identity Cloud Service
  2. Provision Oracle Database Cloud Service to store the various Fusion Middleware RCU schemas; in my case I chose Oracle Database 11gR2 as the database type as it avoids the complexity around CDBs and PDBs you get with the 12c database release
  3. Then provision Oracle Event Hub Cloud — Dedicated with one OCPU for the main Kafka VM, one for the REST Proxy VM and another one OCPU for the Kafka Connect VM. For a real deployment you’d want at least two OCPUs for the Kafka service VM but using just the one kept me within my overall OCPU quota limit when installing the rest of the stack
  4. Next step is to provision Big Data Cloud with a single node with the minimum 2 OCPUs, the Full deployment profile and version 2.1 of Apache Spark as that’s the version OAC Data Lake Edition insist on in the next step. When prompted, choose the option to associate Event Hub Cloud and Database Cloud with Big Data Cloud as you won’t get the option to do this again after the initial service provision; once provisioned, open-up the TCP port for Ambari (8080) to the public internet so that OAC in the next step can associate with it — provisioning for OAC failed for me every time until I looked through the provisioning logs and spotted this as the issue
  5. Finally, provision Oracle Analytics Cloud and choose Data Lake Edition as the package option, again in my case assigning a single OCPU and selecting Data Lake Edition as the software version

At that point if you then bring up the Cloud Services dashboard and review the services together for the first time, it’ll look something like this:

Oracle Data Lake stack within Cloud Services Dashboard

Now it’s time to ingest some data and land it into Oracle Big Data Cloud.


The streaming IoT, wearables and social media comms data that I’ll be ingesting into Big Data Cloud will be coming in from the public internet over TCP, and I’ll also want to connect to Event Hub Cloud from my desktop using tools such as Kafka Tool so an addition configuration step I’ll do before setting up anything else is to open-up Event Hub Cloud’s Kafka broker endpoint to the public internet using the Access Rules menu item in the Event Hub Cloud console.

Now I can see the Kafka service and the default topics that Event Hub Service created for me in Kafka tool.

I can either then use Kafka tool to create a new topic to start receiving the first of my data streams, the IoT device event data coming out of Samsung SmartThings, or create the topic as by defining a new Event Hub Cloud service from within the Event Hub Cloud Service — Dedicated console (confusing, but that’s how Kafka topics are named within Event Hub Cloud)

Then it’s just a case of directing the stream of IoT event data to the public Kafka broker endpoint exposed by Event Hub Cloud Service — Dedicated and then, after a short while, checking the metrics for the new Kafka topic that I setup to receive this incoming streaming data.

Getting the data off the Kafka topic and into a Hive table on the Big Data Cloud instance involved the following steps, using Oracle Cloud Infrastructure Object Storage Classic as the intermediate staging layer together with Event Hub Kafka Connect’s OCS Sink Connector:

  1. Configure Event Hub Kafka Connect OCS Sink Connector to push topic events to Oracle Cloud Infrastructure Object Storage Classic (OCS)
  2. Using Zeppelin notebook provided by Big Data Cloud Console, create a CRON job that copies those events across to HDFS storage
  3. Create Hive external tables with location clauses that point to the directories I’ve copied the event files into from OCS

Then, when I go and log into OAC Data Lake Edition and connect to the Hive Thrift Server on the Big Data Cloud instance I can see the Hive tables I’ve just created, and the data that’s now streaming through from my two initial sources via the Kafka service and Kafka Connect running on Event Hub Cloud Service — Dedicated.

In the second half of this two-post series I’ll go deeper into OAC Data Lake Edition and see how its additional transformation and analysis capabilities stack-up against OAC Standard Edition, and also see how it compares to the Oracle Big Data Discovery tool its looking to eventually replace.

Updates to Oracle Analytics Cloud, Oracle BIEE 12c and Oracle DV Desktop

Late last year I covered one of the first releases of Oracle Analytics Cloud (OAC) on this blog covering the v3 release of OAC; since then the product has had a major UI refresh with OAC v4 and so I thought it’d be worth covering this new release along with updates to OBIEE12c and DV Desktop in a follow-up to last year’s post.

OAC’s original look and feel was based on OBIEE12c v1 that itself contained the first iteration of Oracle Data Visualisation (DV) and whilst the new modern flat UI that release introduced was an improvement on the 11g release before it since then Oracle have rapidly iterated with Oracle DV Desktop and DV Cloud Service and the UI from those products is now the primary user interface for Oracle Analytics Cloud v4.

If you’ve not used Oracle Analytics since the OAC v3 and the initial OBIEE12c versions you’ll be surprised how much the user experience has changed since then; when OBIEE12c first came out most of us still considered Answers and Dashboards to be the primary reporting UI with the BI Repository being the central carefully governed and dimensionally-modelled source of all reporting data.

Since then the world has moved-on and analytics is now all about empowering the end-user through self-service, easy-to-use tools that enable you to do most tasks without having to involve the IT department or do lots of up-front data modeling and data onboarding. Oracle DV Desktop started meeting this need by introducing basic data preparation features into the product so users could upload spreadsheets and other other data and do some basic formatting and tidying before analyzing it on their desktop, and those data upload and preparation features are now available for OAC users in the v4 release. To upload a spreadsheet file into OAC you now just drop it onto the web page and you’re then given a preview and the option to upload its contents to the analytics instance.

After that you can trim, split, derive and change the datatypes of data you upload so that it works in the most optimal way with the analysis features of the product, for example by turning dates in string formats not recognized by Oracle DV into proper date datatypes that can then be used in time-series analysis.

For data transformation needs that go beyond basic formatting and datatype changes you can now build ETL-style routines that filter, aggregate, transform and add external datasets all using simple point-and-click user tools, and if you’ve licensed the new OAC Data Lake Edition these transformations then extend to include model training, forecasting and sentiment analysis. In the example below I’m using OAC Data Lake Edition to forecast weight forward from a set of smart scale readings that’s then executed at the BI Server level using Python maths and statistics libraries.

I’ll actually be talking about OAC Data Lake Edition in my presentation at ODTUG KScope’18 at Walt Disney World Florida this June, where I’ll cover some of these extended data flow features along with Oracle Big Data Cloud and Oracle Event Hub Cloud Service in my session “BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition”.


Cloud of course is great but not every Oracle Analytics customers has made the move yet, or like myself you might have a cloud instance you can spin-up as-and-when you need it but then use regular on-premises OBIEE12c in a VM for your day-to-day learning and experimentation.

With Oracle’s cloud-first strategy and focus on Oracle DV as the main innovation path for analytics features this meant that my OBIEE12c instance running in a Google Compute Engine VM was starting to lag behind OAC v4 and DV Desktop in-terms of look-and-feel and all of these new features, and so I was pleased to note the other day that the 12.2.1.4.0 (aka v4) release of OBIEE12c is now available for download on OTN that includes many of the features currently available in OAC v4 and DV Desktop (but not the advanced data flow features in OAC Data Lake Edition, for example)

Oracle Data Visualization Desktop has also been updated recently to connect to Oracle Autonomous Data Warehouse Cloud and I’ll be covering that new capability in a blog post in the near future; but more importantly for me at least….

… it also now supports Mac OS X 10.13 High Sierra, great news for us Mac users who had to stop using DV Desktop on our laptops late last year when that most recent update to OS X came out but DV Desktop didn’t yet support it.

First Impressions of Oracle Autonomous Data Warehouse Cloud

Regular readers of this blog will know that my day-to-day work involves ingesting, storing and then analyzing petabytes of data using Google BigQuery, one of a new generation of fully-managed, cloud-hosted data warehouse platforms-as-a-service. Services like BigQuery take care of all the database infrastructure management and capacity planning leaving you just to define tables, load data into them and then run queries using SQL or your favourite BI tool.

Oracle recently announced Oracle Autonomous Data Warehouse Cloud (ADWC) that competes in this market and given my background in Oracle technology I was interested to see how it stacked-up compared to BigQuery and similar services from the likes of Amazon AWS and Snowflake. To do this I took advantage of Oracle’s free $300 service credit offer and signed-up for an ADWC instance – note that at the time of writing the ADWC service is only available in Oracle Public Cloud’s North American datacenter region as opposed to the European ones so make sure you choose the former when creating your Oracle Cloud account.

Setting up an Oracle on-premises database was something once-upon-a-time I’d spent months and thousands on course fees studying for certifications to be able to do confidently, so it was nice to see how things had got a lot simpler with setting up an ADWC instance with the first step being to select ADWC from the Oracle Cloud Services menu, like this:

Then answer a few simple questions: the name to give the instance, number of CPUs and amount of storage to set aside initially, and the password for the administrator account.

Setup of an instance takes a few minutes, then its ready. Not bad.

The next big change you notice compared to Oracle’s customer-managed Database Cloud Service is the replacement of Enterprise Manager Database Express 12c and DBaaS Console with a single Service Console, accessed not through the usual SYS and SYSTEM superuser accounts but through a new superuser account called ADMIN, presumably adopting a naming standard more familiar to users of other SaaS and PaaS platforms.

The Service Console contains a couple of performance-tracking pages with the emphasis on read-only viewing of those stats vs. giving you controls to fine-tune database parameters or table structures, and a page where you can download a zip file containing a wallet directory with TNSNAMES.ORA and encrypted connection details to your ADWC instance.

I seem to remember at the time of Autonomous Data Warehouse Cloud’s launch that many in the Oracle community (myself included, most probably) said that all that the “autonomous” part of Autonomous Data Warehouse Cloud did was automate the Oracle Database maintenance tasks that more recent data warehousing platforms-as-a-service handled for you automatically. Another common critique at that time was that under the covers ADWC was more scripting and use of existing “advisors” than true autonomous artificial intelligence-style database management.

To be honest though, it doesn’t matter; compared to running a regular Oracle database you just fill in a short form and the platform runs without any further intervention from you, you don’t need to know how it works, job done in terms of competing with BigQuery and Redshift for ease-of-use and maintenance.

Connecting to ADWC is a bit different to regular Oracle Database connections in that it’s easier — no need to specify host, port and SID/service name — but this involves downloading a ZIP file wallet and the tool you wish to connect having support for that wallet. Oracle SQL Developer in its most recent 18c version has that support making it the obvious tool to define new users, create tables and upload data into them and then run simple SQL queries against your data.

Oracle ADWC is based on Oracle Exadata Database Server technology and supports PL/SQL, full INSERT/UPDATE/DELETE support in SQL as well as SELECT, parallel query (enabled by default) and hybrid columnar compression.

What you don’t get are indexes and materialized views as these generally aren’t considered necessary when running on Exadata hardware that’s optimized for full table scans of data arranged in compressed columnar format, and as ADWC charges per CPU per hour there’s no penalty in scanning lots of detail-level data unnecessarily as you get with BigQuery.

Also missing are features such as the Oracle OLAP Option, Oracle Spatial and Graph and Oracle Text that I’m guessing might get implemented at a later date, along with management features such as Oracle Tuning Pack that are no longer needed when the autonomous features of the platform do the database tuning for you.

Price-wise, although as I said a moment ago you do get charged by the amount of data you scan (and a bit for storage) with Google BigQuery, Oracle AWDC seems to be more expensive to run at least at the data volumes I’d be using as a developer initially evaluating the product.

If you commit to monthly charging then ADWC costs around $1200/cpu/month whereas Google charge $5/TB for queries with the first 1TB each month free plus a token amount for storage and streaming inserts, which meant my bill last month for BigQuery came to just $1.53 and my total bill for all Google Cloud Platform services including multiple hosted VMs and extensive usage of ther NLP and Geocoding APIs took the total up to just over $150. Clearly Google are aiming more at the startup market whereas Oracle are serving enterprise customers, but don’t make the assumption that just because AWDC is elastic it’s also super-cheap for hobbyist use, long-term.

So now its a case of loading some sample data into ADWC using SQL Developer and then running a couple of test queries to see it working in action with my data.

Tomorrow, I’ll show how Looker be connected using the newly announced support it has for Oracle ADWC and later on in the week do the same for Oracle Analytics Cloud and DV Desktop.

Timeline Charts, Derived Tables and Analytic Functions in Looker 5

One of the data streams that comes into my personal Google BigQuery data warehouse is a record of all the Hue light switch settings and illuminance levels, Nest thermostat settings and motion sensor readings coming in via the Samsung Smart Things hub under the stairs back home. For the Hue lightbulbs I get a stream 1s and 0s that initially get stored in a four-column table in BigQuery, like this:

I then create a SQL view over that table that joins to a lookup table containing device types and room allocated for each of the devices in the data stream, and sets a metric type for each recording so that I can analyze all the thermostat readings together, bring together all the light switch settings and so on.

Its then quite straightforward to bring this SQL view into Looker and add it to the rest of the data streams I’ve brought together there so I can see, for example, whether the kitchen lights get turned-off after everyone’s gone upstairs after breakfast or whether I should try and automate those as well as my kettle.

Looker 5 introduced another way to visualize data with defined start and end events in the form of the new Timeline Chart, a feature I tweeted about the other day when I brought data on tasks from Asana into my BigQuery data warehouse via a Fivetran integration.

It’d be interesting to use this type of chart to show how one particular light switch changed from on to off and then on again over a period of time, or take all the Hue lights for a given part of the house and see whether anybody turns the lights off ever … or whether they need to be linked to the motion sensors and automatically turned-off after a period of inactivity in each room.

To use this type of chart I first need to create an “end date” field for each of my event records, so each timeline has a start and end date with other fields being optional for label and value. Looking at the list of events for the light switch I looked at earlier and ordering the events by time recorded you can see the information I need is there:

SELECT
*
FROM
`aerial-vehicle-148023.personal_metrics.smartthings_readings`
WHERE
device = 'kitchen seating area spot 1'
AND raw_value IN ('off','on')
ORDER BY
date_time

What this calls for is the LEAD BigQuery Standard SQL analytic function that returns a value based on a subsequent row to the one we’re working with, and I can combine that with the TIMESTAMP_DIFF function to compute the time between the start and end event dates as that would be useful to know too. The SQL expressions for the two new fields would therefore be:

LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds

I could add these column expressions into the SQL view definition in BigQuery and order the rows by device, metric and timestamp, like this:

SELECT
date_time as date_time,
device as device,
metric,
value,
raw_value,
LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) as end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds
from `aerial-vehicle-148023.personal_metrics.smartthings_readings`
where device = 'kitchen seating area spot 1'
and metric = 'switch'
and raw_value in ('on','off')
order by 2,3,1

Executing that SQL shows the logic is working, but I’d then have to maintain that view within BigQuery and that might not be the most convenient place to add new code.

Instead, I could just go into Looker and create a new view there based on a derived table SQL expression and do the work there. Right now my LookML model looks like the excerpt below, where you can see the sensor readings view joined into the rest of the explore so all my data can be analyzed together.

connection: "rittman_bigquery"
include: "*.view.lkml" # include all views in this project
include: "*.dashboard.lookml" # include all dashboards in this project
explore: date_dim {
case_sensitive: no
label: "Data Warehouse"
join: fluentd_transactions {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_transactions.date_minute5} ;;
relationship: many_to_many
}
join: smartthings_readings {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${smartthings_readings.date_minute5} ;;
relationship: one_to_many
}
join: fluentd_uber_rides {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_uber_rides.date_minute5} ;;
relationship: many_to_many
}
}

I now create a new LookML view that uses a derived table SQL query as the view definition rather than simply referencing an existing BigQuery table. Note how I’ve used the same view_label as the main LookML view containing my event data, so that the dimensions and metric I define here appear alongside all the other smart device fields the same explore view.

view: device_event_end_and_timespan {
view_label: "6 - Smart Devices"
derived_table: {
sql: SELECT
date_time as date_time,
device as device,
metric as metric,
LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time) as end_date_time,
timestamp_DIFF(LEAD(date_time, 1) OVER (PARTITION BY device, metric ORDER BY date_time),date_time,second) as value_duration_seconds
from `aerial-vehicle-148023.personal_metrics.smartthings_readings`
order by 2,3,1;;
}
dimension: date_time {
type: date_time
hidden: yes
sql: ${TABLE}.date_time ;;
}
dimension: device {
type: string
hidden: yes
sql: ${TABLE}.device ;;
}
dimension: metric {
type: string
hidden: yes
sql: ${TABLE}.metric ;;
}
dimension_group: end_date_time {
group_label: "End Date"
label: "End"
type: time
timeframes: [
raw,
time,
hour,
hour3,
hour4,
hour6,
hour12,
hour_of_day,
time_of_day,
minute,
minute5,
minute10,
minute15,
minute30,
day_of_week,
day_of_month,
day_of_year,
date,
week,
week_of_year,
month,
month_name,
month_num,
quarter,
year
]
sql: ${TABLE}.end_date_time ;;
}
dimension: pk {
primary_key: yes
hidden: yes
type: string
sql: concat(cast(${TABLE}.start_date_time as string), ${TABLE}.device) ;;
}
measure: value_duration_seconds {
type: average
sql: ${TABLE}.value_duration_seconds ;;
}
}

Then I join this derived table view back into the explore within my LookML model, as below:

connection: "rittman_bigquery"
include: "*.view.lkml" # include all views in this project
include: "*.dashboard.lookml" # include all dashboards in this project
explore: date_dim {
case_sensitive: no
label: "Data Warehouse"
join: fluentd_transactions {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_transactions.date_minute5} ;;
relationship: many_to_many
}
join: smartthings_readings {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${smartthings_readings.date_minute5} ;;
relationship: one_to_many
}
join: device_event_end_and_timespan {
type: left_outer
sql_on: ${smartthings_readings.date_time} = ${device_event_end_and_timespan.date_time} and
${smartthings_readings.device} = ${device_event_end_and_timespan.device} and
${smartthings_readings.metric} = ${device_event_end_and_timespan.metric};;
relationship: one_to_one
}
join: fluentd_uber_rides {
type: left_outer
sql_on: ${date_dim.date_minute5} = ${fluentd_uber_rides.date_minute5} ;;
relationship: many_to_many
}

and now I can create timeline charts that show me which lights were on over a particular period of time, like this:

or more importantly, work out why the bathroom never seems to be free when you’ve got two teenage kids in the house.

Druid, Imply and Looker 5 bring OLAP Analysis to BigQuery’s Data Warehouse

Back in the good old days of on-premise data warehousing projects where data was structured, the software came on CDs and data engineering was called ETL development (and was considered the most boring job on a project, how times have changed) a typical technical architecture had sources on the left-hand side, a relational database hosting the data warehouse in the middle, and a multi-dimensional OLAP (or “MOLAP”) server on the right-hand side serving up data — fast — to specialist query and reporting tools.

Diagram courtesy of Readings in Database Systems, 3rd Edition, Stonebraker & Hellerstein, eds, 1996.

OLAP (Online Analytical Processing) Servers were a great complement to the relational database servers that hosted data warehouses back in those days by taking a subset of the whole dataset, structuring it into a dimensional model then storing it as a indexed arrays of leaf-level and pre-computed aggregates, served up results with split-second response times for any combination of dimension filters or level of aggregation requested … and users loved them.

The first ten years of my consulting career were spent working with OLAP Servers and if you’re interested in their history I’d recommend you check out Episode 37 of the Drill to Detail Podcast where I talked about arguably the last pure MOLAP Server still actively sold and implemented, Oracle’s Essbase Server, with Cameron Lackpour; Episode 34 of the Drill to Detail Podcast with Donald Farmer, the original Product Manager behind Microsoft Analysis Services and Episode 11 of the Drill to Detail Podcast with Graham Spicer, my original mentor and the single person most responsible for the nearly twenty-year career I’ve had since then in consulting, data warehousing and now product management, on Oracle’s original Express Server and then OLAP Option technologies.

But just like mainframes that reached perfection just at the time when PCs and mini-computers made them obsolete, OLAP Servers fell out of favour as data volumes and data types exploded whilst time available for loading them via batch processes just disappeared. On-premise data warehouses eventually transformed into elastic, cloud-hosted data warehouse platforms provided as fully-managed services such as Google BigQuery, Snowflake DB and Oracle Autonomous Data Warehouse Cloud Service and were accompanied by a new generation of BI tools like Looker, aimed at data-driven tech startups needing to analyze and understand vast amounts of consumer activity and other event-level behavioural data, as I talked about in my session at the recent Looker Join 2017 conference in San Francisco on Qubit, BigQuery and Looker for petabyte-scale analytics.

But BigQuery is, at the end of the day, a query and compute engine optimized for data warehouse-style queries and workloads albeit at a scale unimaginable ten years ago; Druid, an open-source project first announced in a white paper back in 2014 and now arguably the standard for new-world distributed data stores optimized this time for sub-second response times, may be the OLAP Server to BigQuery’s data warehouse.


To be clear, BigQuery and other distributed query engines like it are fast, particularly when filtering, sorting and aggregating single wide tables of columnar-organized data as you can see in the video below where I query and aggregate around four-and-a-half million smart device events to find out the average monthly temperature in each of the rooms in my house.

[embed]https://youtu.be/kd2403Pe4f4[/embed]

BigQuery supports joins between large tables, uses ANSI-standard SQL and more recently has benefited from a number of improvements to improve the response time for small queries as well as large ones, but compared to OLAP servers that typically pre-compute in-advance all the different aggregations and store data indexed and organized by the dimensions that users filter results by, it’s definitely a general-purpose database engine rather than a single-purpose OLAP server, and all query aggregations have to be computed on-the-fly.

Druid, originally authored by Eric Tschetter and Fangjin Yang at Metamarkets in 2011 and described in-detail in this white paper from 2014 explicitly re-implements key features of old-school OLAP servers by pre-aggregating incoming real-time streaming and batch data and storing it in a more compressed form, organizes that compressed data as time-based segments bitmap-indexed by dimensions and then presents data out as OLAP cubes to client applications.

Image courtesy of “What is Druid”, image downloaded in Oct. 2017

Druid has some significant limitations compared to more general-purpose analytic database engines such as BigQuery; it doesn’t support table joins right now (though it may do at the time you read this, as an open-source project it evolves rapidly), its primary client interface is JSON over HTTP, and most importantly for organizations that moved to BigQuery because it runs as infrastructure-as-a-service you have to take care of server upgrades, capacity scaling and all the other infrastructure management tasks that we thought we’d said goodbye to with data warehouse-as-a-service platforms.

But companies offering services and management tools to manage Druid as just another platform service are starting to emerge and courtesy of the Apache Calcite project it’s now possible to query Druid using regular SQL queries, a capability Looker recently took advantage of to offer Druid connectivity as one of the new features in their recent Looker 5 release, as you can see me demonstrating in the video below.

[embed]https://www.youtube.com/watch?v=IQ1Ce65HaHI[/embed]

But just as old-school OLAP servers worked best with query tools specifically designed to work with them, new open-source BI tools such as Superset (from the same engineering team at Airbnb that also brought us Airflow, my current ETL orchestration tool of choice) connect directly to Druid clusters and come close to their commercial rivals in terms of reporting and dashboard features offered to end users; in the video below you can see me creating a percentage-changed line-graph showing how the amount of time I spend cycling each month changed over time, using the same Druid datasource as in the other videos.

[embed]https://youtu.be/3uU10q7mPBo[/embed]

Superset, Looker and other BI tools that now support Druid are of course great but the one that’s really got my interest, and prompted me to look further into Druid and how it complements BigQuery and other data warehouse cloud platform cloud services is Imply, a startup launched by one of the original co-authors of Druid who, not unlike Looker who reinvented the enterprise BI platform for the big data and startup world, are reintroducing that same world to OLAP analysis whilst making the Druid back-end much easier to manage.


Imply runs either on-premise as open-source software you can download and then install on local or cloud-hosted VMs, or run as platform-as-a-service through a contract with Imply. Druid is one of the more complex and involved analytic database types to load but Imply’s cloud service makes it simple to spin-up, manage and then ingest data into your Druid cluster either as real-time streaming sources, or via batch loads from Amazon S3 or other popular datasources.

Images courtesy of Imply Cloud Quickstart docs page

I’ve got Imply running on my own Google Compute Engine infrastructure-as-a-service platform so take care of server management and data ingestion manually, but for me the standout feature in Imply’s platform is Pivot, their open-source OLAP query tool. If any reader is old enough to remember OLAP client tools such as Oracle Express Sales Analyzer and ProClarity you’ll recognize Pivot’s use of terms such as cubes, dimensions, aggregation types and measures as shown in the screenshot of my setup below…

… but more importantly, you’ll recognise the structured query environment and lightning-fast response to my queries against that same set of four-and-a-half million IoT and other events that I extracted from my BigQuery environment and then loaded and stored in compressed column-stored segments pre-aggregated and indexed by the same dimension fields I’m now analysing it by.

[embed]https://www.youtube.com/watch?v=Dj38w2nhNyI[/embed]

Well they say nothing’s new in fashion or music if you wait long enough, and sure enough as I said in my tweet a couple of years ago…

[embed]https://twitter.com/markrittman/status/638311769556025344[/embed]

… yes it does make me feel old, but it’s great to see such a powerful concept as multidimensional OLAP storage and dimensional models being rediscovered by the big data and startup worlds.

Google BigQuery, Large Table Joins and How Nested, Repeated Values and the Capacitor Storage Format…

I’m flying over to Hungary early next week to meet my good friend Bence Arato, and to speak at the Budapest Data Forum about analytics and big data on Google’s Cloud Platform using BI tools such as Looker, both of which I’ve written about on this blog over the past few months. BigQuery lets you store gigabytes and petabytes of data in a distributed, cloud-based query and compute layer that automatically scales-up as you load more data into it, and presents data out in familiar tables and columns, datasets (schemas) and projects (databases) that would be immediately recognisable to anyone coming into this new technology from the old world of relational databases and SQL queries.

But that familiarity can be deceiving sometimes, particularly when you start to really do things at scale. Take, for example, two of the sample datasets that Google make available on the Google BigQuery Public Datasets page; “Bay Area Bike Share Trips” that contains records of all public bike hire trips in San Francisco over a number of years:

and another that contains records of all the San Francisco 311 (non-emergency) calls made since 2008 including reason for the call, time and caller location, both of which are ready for analysis and can be queried immediately using BigQuery’s ANSI SQL-standard query language.

The bike rides dataset contains just under a million records stored in a 120MB table and returns answers to queries in just a few seconds with none of the indexes, summary tables or in-memory caches that traditional databases and OLAP servers required you to populate and manage to keep query response times within an acceptable limit.

In fact, combining these two datasets together so I can see which bikeshare locations had the most traffic accidents associated with them would be an interesting exercise, and BigQuery supports joins between tables using the ANSI join syntax, like this:

SELECT
zip_code, count(trip_id) as total_trips, call_type, count(call_type) as total_calls
FROM
`bigquery-public-data.san_francisco.bikeshare_trips` b
LEFT JOIN `bigquery-public-data.san_francisco.sffd_service_calls` s
ON b.zip_code = s.zipcode_of_incident
WHERE
call_type = 'Traffic Collision'
GROUP BY
1,3

Joining large fact table-style tables together with smaller dimension-style tables works fine with BigQuery, but joining these two large denormalized fact tables together gives us a resources exceeded error after the query runs for over fifteen minutes.

What we’re seeing here is the trade-off we get when running data warehouse workloads on a distributed query platform like BigQuery. The size of datasets we can work with become huge, but joins between those datasets become relatively expensive due to the need to sort, co-ordinate and send intermediate results between thousands of server nodes in the the cluster.

The usual answer is to denormalize even further, creating one single large table containing all the columns from both input tables and multiple rows for each base fact if you’ve got a one-to-many (or many-to-many) relationship between the joined tables. Storage is cheap with cloud platforms from Google and Amazon but BigQuery offers another solution that relies on the Colossus distributed file system that Google replaced Google File System with(the inspiration for HDFS), and in-particular the Capacitor columnar storage format that BigQuery leverages for its fast, column-store table storage


Google BigQuery supports several input formats for data you load into tables — CSV files, JSON files, AVRO files and datastore backups — but under the covers BigQuery uses a columnar storage format developed by Google called Capacitor (originally called ColumnIO) that’s used by Google’s replacement for GFS/HDFS, the Colossus distributed filesystem. Like Parquet and other columnar storage formats that were inspired by ColumnIO, it stores data in compression-friendly columns of nested, repeating values that suit BI-style applications that filter and aggregate big, wide tables of data rather than process transactions using whole rows of data.

What this means for BigQuery and our two tables of data is that we’d be better off storing the joined table’s values as nested repeating field types in the first table, avoiding the join altogether and storing our data in the way best suited to BigQuery’s storage format. To take an example, if we took the aggregated bike trips values and wrote them out to a JSON format input file like this:

{"zipcode":"94537",
"trips":[{"start_station_name":"Embarcadero at Vallejo",
"avg_duration_sec":8787,
"trip_count":"2"},
{"start_station_name":"Townsend at 7th"
,"avg_duration_sec":807,
"trip_count":"1"}]

that input file could then be used to populate a table that stored each of the trips for a given zipcode as nested repeated values in a single column rather than one row per trip, and with BigQuery’s columnar storage those trips would only be retrieved from storage if that particular column was included in the query, like this:

SELECT zipcode, trips.trip_id, trips.duration_sec
FROM personal_metrics.bike_trips_nested,
UNNEST (trips) as trips
LIMIT 10

If I’ve not got input files handy in this nested repeated JSON format, I could use BigQuery Standard SQL to output nested fields using the ARRAY_AGG(STRUCT() functions, like this:

SELECT zip_code as zipcode, 
ARRAY_AGG(STRUCT(trip_id, duration_sec, start_date,
start_station_name, start_station_id, end_date,
end_station_id, end_station_name, bike_number))
FROM `aerial-vehicle-148023.personal_metrics.bikeshare_trips`
GROUP BY 1

To store the data in this nested format make sure you store the query results in a table, check the Allow Large Results checkbox and deselect the Flatten Results checkbox when using the Web UI.

You can either then export that tables’ contents into a JSON format output file with nested repeated JSON elements, or query the table in place for more efficient lookups from the repeated dimension values. Taking this to the logical conclusion you could do this for both of the tables we’re looking to join together as they both have these nested records — bike trips per zipcode for one, police incidents per zipcode in the other — and create one big SELECT statement that nests both of these and outputs one big denormalized table with nested repeated column values for each zipcode’s bike trips and incidents records, like this:

WITH
bike_trips AS (
SELECT
zip_code AS zipcode,
start_station_name,
AVG(duration_sec) AS avg_duration_sec,
COUNT(trip_id) AS trip_count
FROM personal_metrics.bikeshare_trips b
GROUP BY 1,2),
incidents_temp AS (
SELECT zipcode_of_incident AS zipcode,
call_type,
COUNT(call_type) AS call_type_count
FROM personal_metrics.sffd_service_calls
GROUP BY 1,2)
SELECT
bt.zipcode,
ARRAY_AGG(STRUCT(bt.start_station_name,
bt.avg_duration_sec,
bt.trip_count)) trips,
ARRAY_AGG(STRUCT(ic.call_type,
ic.call_type_count)) incidents
FROM bike_trips bt
LEFT JOIN incidents_temp ic
ON bt.zipcode = ic.zipcode
GROUP BY 1

and then the equivalent to the two-table join query right at the start would look like this:

SELECT
zipcode, count(zipcode_trips.trip_count) as total_trips, zipcode_incidents.call_type, count(zipcode_incidents.call_type) as total_calls
FROM
`aerial-vehicle-148023.personal_metrics.sf_biketrips_incidents_nested`
LEFT JOIN UNNEST (trips) as zipcode_trips
LEFT JOIN UNNEST (incidents) AS zipcode_incidents
WHERE
zipcode_incidents.call_type = 'Traffic Collision'
GROUP BY
1,3

and then, most importantly, this new query where we’ve eliminated the large table-to-large table join and nested the dimension values in a format that aligns with BigQuery’s storage format, returns results in … 7 seconds.


Not all ODBC drivers and BI tools support BigQuery’s nested repeated column format but Looker, the BI tool I’m working with day-to-day on BigQuery right now does, supporting nested joins in LookML like this:

explore: sf_biketrips_incidents_nested {
join: sf_biketrips_incidents_nested__incidents {
view_label: "Incidents"
sql: LEFT JOIN UNNEST(${sf_biketrips_incidents_nested.incidents}) as sf_biketrips_incidents_nested__incidents ;;
relationship: many_to_one
}
join: sf_biketrips_incidents_nested__trips {
view_label: "Trips"
sql: LEFT JOIN UNNEST(${sf_biketrips_incidents_nested.trips}) as sf_biketrips_incidents_nested__trips ;;
relationship: many_to_one
}
}
view: sf_biketrips_incidents_nested {
sql_table_name: personal_metrics.sf_biketrips_incidents_nested ;;
dimension: incidents {
hidden: yes
sql: ${TABLE}.incidents ;;
}
dimension: trips {
hidden: yes
sql: ${TABLE}.trips ;;
}
dimension: zipcode {
type: zipcode
sql: ${TABLE}.zipcode ;;
}
measure: count {
type: count
drill_fields: []
}
}
view: sf_biketrips_incidents_nested__trips {
dimension: id {
primary_key: yes
sql: CONCAT(sf_biketrips_incidents_nested.zipcode,${TABLE}.start_station_name) ;;
}
measure: avg_duration_sec {
type: average
sql: ${TABLE}.avg_duration_sec ;;
}
dimension: start_station_name {
type: string
sql: ${TABLE}.start_station_name ;;
}
measure: trip_count {
type: sum
sql: ${TABLE}.trip_count ;;
}
}
view: sf_biketrips_incidents_nested__incidents {
dimension: id {
primary_key: yes
sql: CONCAT(sf_biketrips_incidents_nested.zipcode, ${TABLE}.call_type) ;;
}
dimension: call_type {
type: string
sql: ${TABLE}.call_type ;;
}
measure: call_type_count {
type: sum
sql: ${TABLE}.call_type_count ;;
}
}

and displaying the two nested columns as regular Looker views in the dashboard, so users don’t even need to be aware that you’ve denormalized and optimized the storage of those two tables and details of bike trips and police incidents are now stored in nested repeated columns.

Like anything in database design or IT in-general nested columns aren’t a silver bullet that solves every large table join or query optimization problem, and they will hit the same sort of resource limit issues if you try to store thousands or millions of dimension values in this nested repeating format.

But for most star-schema style queries across sets of denormalized fact and dimension tables they work better when you store data in this format, and using the ARRAY_AGG and STRUCT functions you can even generate test data in this more optimized format without having to write your own JSON data exporter to nest the dimension values properly.

What BI Development Looks like with BigQuery, Google Cloud APIs, Looker and Fluentd (courtesy of…

For the last six months I’ve been working in London helping build out an analytics service built on Google’s Cloud Platform, BigQuery and the Looker BI tool. It’s been a very interesting and rewarding experience working within product management in an engineering team using agile development practices, developing with node.js and clojure and shipping releases every few months.

However, working in product management rather than in development or consulting means I’m more likely to be creating product roadmaps and working on partner enablement than developing using BigQuery and Looker, so I try and use the journey time up to London each day to do some development myself, work out what development requests and users stories would be easy or hard to have our engineering team deliver … and because as a techie and career-long BI and data warehousing developer this stuff is just too cool and too interesting to not get developing with.


So over the past few months I’ve been taking the IoT and wearables data analysis system I put together using Hadoop running on servers back home and ported it to Google BigQuery and more recently Looker running in the cloud, using Fluentd as the log aggregation engine running on a Google Compute Engine-hosted VM to collect, process and ship to BigQuery all the data from wearable devices, social media and other services along with IoT event data from SmartThings that I previously fed into logstash and Hadoop. Adding Google Cloud Storage for staging incoming data extracts from services such as Moves and Google Takeout and running it all in Google Cloud Service, this updated version of my analytics architecture looks like the diagram below.

Typical cost for running a VM hosting a simple Fluentd server and running all month is around $80. BigQuery is more or less free for developer use with charging based largely on how much you query with some costs for streaming inserts and data storage, but with the first 1TB of queries free each month and minimal storage needs, Google’s cloud data warehouse platform has pretty-much taken over from Oracle as my day-to-day development platform because, like Oracle did in the days of OTN downloads of on-premise developer tools and databases with permissive licensing for personal training and developing prototype applications, you can learn the technology essentially for free and really get to know and understand the tools without worrying about short trial license periods or having to pay thousands of dollars for full commercial licenses.


I’ve written and presented on Google BigQuery a few times since being introduced to it late last year so I won’t go into how this distributed, column-store query engine based on Google’s Dremel engine works, but there’s another technology I used as part of this round of development, an open-source technology called Fluentd that I used instead of Flume and Logstash for the log aggregation part of this project that I’ve found particularly useful given its range of input and output data source plugins, and its support for PubSub, BigQuery and other parts of the Google Cloud Platform making it easy to accept data from IFTTT and SmartThings and then stream it, lightly transformed and consistently timestamped, into a set of BigQuery tables.

For example, creating a data feed that brought in metrics on car journeys via the Dash IoS app and a small bluetooth device that plugs into your car’s diagnostics port involves first registering an IFTTT applet to trigger when that app registers a journey complete event, with the applet payload then being an IFTTT Maker webhook that sends the journey metrics to Fluentd in JSON format as an HTTP POST request.

Fluentd then receives the JSON document using its HTTP Input plugin, processes, transforms and timestamps the record and then sends it as a streaming insert into a one of the tables I’ve setup in Google BigQuery, over which I’ve then created a SQL view that transforms all of IFTTT’s odd “3h 5m” and “August 14, 2004 at 10:15PM” date and time formats into additional timestamps that Looker and other BI tools can automatically parse into date and time elements.

select date_time, 
timestamp_trunc(date_time,DAY) as date_day,
PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(StartedAt,', ',' '),' at','')) as journey_start_date_time,
PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(EndedAt,', ',' '),' at','')) as journey_end_date_time,
TIMESTAMP_DIFF(PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(EndedAt,', ',' '),' at','')),
PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(StartedAt,', ',' '),' at','')), MINUTE) AS journey_mins,
StartAddress,
StartWeatherConditions,
StartLatitude,
StartLongitude,
DrivingScore,
cast(replace(DistanceDrivenWithLabel,' Km','') as FLOAT64) as distance_km,
EngineLightAlerts,
HardBrakeAlerts,
HardAccelerationAlerts,
SpeedAlerts,
cast(replace(FuelConsumedWithLabel,' l','') as FLOAT64) as fuel_l,
cast(replace(AvgDistancePerHourWithLabel,' kph','') as FLOAT64) as speed_kph,
cast(replace(AvgFuelConsumptionWithLabel,' kpl','') as FLOAT64) as fuel_econ_kpl,
EndLatitude,
EndLongitude,
EndAddressMapURL
from `xxxxx.personal_metrics.fluentd_dash_journeys`

One of the nice things you can do with Fluentd is call arbitrary ruby scripts to enrich incoming data via the Fluentd script plugin, something I’ve used to analyze for sentiment and key entities all incoming tweets, emails, article saves and other social media and other communications using Google’s new Natural Language Processing API. To do this I pass all incoming records of these types through a filter definition in the Fluentd config file, like this…

<filter reformed.socialmedia>
type script
path /home/mark/gcp_nlp/nlp_process.rb
</filter>

…with that script then using the Ruby Cloud Natural Language Client to calculate the sentiment score and magnitude of that sentiment, extract the entities from the incoming text body and then pass these as new Fluentd record keys back to the output plugin to write as additional columns into the BigQuery table I used to store all these communication events.

def configure(conf)
super
require "google/cloud/language"
end
def start
super
end
def shutdown
super
end
def filter (tag, time, record)
require "google/cloud/language"
project_id = "xxxxxxxx"
language = Google::Cloud::Language.new project: project_id
text = record["post_body"]
document = language.document text
sentiment = document.sentiment
entities = document.entities
record["sentimentscore"] = "#{sentiment.score}"
record["sentimentmagnitude"] = "#{sentiment.magnitude}"
entitylist = ""
entities.each do |entity|
entitylist = entitylist + "#{entity.name} "
end
record["entityname"] = "#{entitylist.strip}"
record
end

Fluentd can also collect the device and sensor events from my SmartThings hub using a groovy “smartapp” running in Samsung’s SmartThings backend cloud service. All that’s left to do is join the incoming hub events to a lookup table that classifies the IoT events as sensors, switches, colour settings and other event types and groups them by room and you’ve got a useful set of home enviroment and activity data to join up with the other dataset.

Other services such as Moves have APIs that I connect to every few hours using clients such as this one also on Github that runs on the same Google Compute Engine VM that runs the FluentD log collector, with new movement data copied across to a Google Cloud Storage bucket and inserted every few hours into another BigQuery table ready for analysis.

I’ve also been collecting weight readings and sleep readings but these can be sporadic, with weight collected via a Withings WiFi scale but often only every few days, whilst sleep time in minutes should be recorded every day but sometimes I forget to wear my health band, so I used BigQuery SQL’s analytic windowing functions to calculate my average weight over 7 days (a good approach anyway as your weight can fluctuate day-to-day), and over three days for sleep.

SELECT
date_time,
AVG(weightKg) OVER(ORDER BY ts
RANGE BETWEEN 604800 PRECEDING AND CURRENT ROW) AS weightKg_avg_7_days,
AVG(FatMassKg) OVER(ORDER BY ts
RANGE BETWEEN 604800 PRECEDING AND CURRENT ROW) AS FatMassKg_avg_7_days
FROM (
SELECT TIMESTAMP_TRUNC(d.date_time,DAY) as date_time, avg(w.weightKg) as weightKg, avg(w.FatMassKg) as FatMassKg, UNIX_SECONDS(TIMESTAMP_TRUNC(d.date_time,DAY)) AS ts
FROM `aerial-vehicle-148023.personal_metrics.date_dim` d
LEFT OUTER JOIN `xxxx.personal_metrics.fluentd_weighings` w
ON d.date_time = TIMESTAMP_TRUNC(w.date_time,DAY)
GROUP BY d.date_time, ts
)
order by date_time asc

These two views are then left-outer joined to the main health daily view so that I’ve got weight and sleep readings every day alongside step count, calories burned and other metrics from my Jawbone UP health band.

So where this leaves me now is at the point where I’ve got a pretty interesting dataset running in a Google BigQuery cloud-hosted column-store data warehouse, with the data updated in real-time and enriched and transformed ready for analysis.

For now though the thing that’s made this all really interesting is joining it all up and then analyzing it as one big dataset using Looker, another tool I talked about recently on this blog and one which re-introduces an old concept familiar to BI veterans, the “semantic model”, to this new world of distributed, cloud-based big data analytics.


Looker the company and Looker the BI tool are both attracting a lot of attention right now after Google’s recent investment, and the general buzz around big data analytics running on Google and Amazon’s cloud platforms. I’ve been working with Looker now for around six months as the preferred BI tool for the platform I’m working with, and Looker’s philosophy towards analytics together with the parallels I can see between the enterprise semantic models I used to create working with tools such as Oracle BI and Looker’s updated take on that idea led me to sign-up as a developer partner with Looker in order to get to understand their platform and technology even closer, the same thought process I went through just over 10 years ago when Oracle acquired Siebel and introduced me to Siebel Analytics and what became the focus of my career and the subject of two books, what in-time became Oracle Business Intelligence.

To be clear and with the benefit of having worked with BI tools for almost 20 years Looker the BI tool is still very primitive in many ways — in terms of data visualization options and general UI and usability it’s frankly, pretty basic and it’s quirky use of terminology such as “looks” (for reports), “explores” (for subject areas) and “views” (for tables) and many activities we’re used to being simple and intuitive in existing BI tools such as creating and saving catalogs of reports are non-obvious to end-users or just plain missing … but crucially, all of these missing features or UI quirks can be iterated on in a SaaS product and instead, Looker have got two things right:

  • Their SQL query generation engine was designed from day one to work with distributed query engines like BigQuery
  • They’ve taken the concept of a semantic model, data model abstraction and combining enterprise datasets but crucially, done it in a way that software engineers and data analysts working in their target market can relate to

Looker was designed from the start to query the underlying data source directly rather create a mid-tier cache and query that instead of the underlying data source; Tools like Tableau and Qlikview did well for many years running their user queries of a local cache on the users’ desktop and gave those users much faster response times than they were used to with old, enterprise BI tools that tried to speed up queries by creating layers of aggregate tables, but this approach just isn’t practical with massive, constantly-updated big data sets

Looker also generates SQL queries aware of the way platforms like BigQuery charge for their use — by the amount of data queried rather than server processors or named users. BigQuery is a column-store database that brings back data only for those columns you ask for, but tools like Tableau routinely request all columns for a table making them impractical to use for these types of data stores; Looker also aggressively elimates joins from queries and supports the nested column approach preferable for distributed data stores like BigQuery where joins are much more expensive to process than we’re used to with relational databases

Finally, Looker has introduced a concept very familiar to traditional BI developers but new to most developers working on big data analytics projects, the concept of a semantic model, metadata layers and data abstraction; importantly though, the approach they’ve taken in LookML is designed to appeal to the actual developers building out these types of systems today — software engineers working on web and big data projects who write code, use modern development lifecycles and version everything in distributed version control system such as Git.

So to take the dataset I put together in BigQuery, the first step in the development process with Looker is to create what’s termed “views” for each of the underlying BigQuery table or view sources; in the example below you can see an (abbreviated, there’s more measure definitions in the full view definition file) definition of the “health stats” LookML view, and things to note in this modelling approach are (1) it’s all code, there’s no GUI for developing this all graphically, which suits software engineers just fine as it’s easier from day one to code, version and diff (compare) with other code releases; (2) the dimension_group we define at the start takes care of extracting all the different time periods out of a BigQuery timestamp datatype, and (3) the measure type definitions of sum, average and so on are there to give Looker more flexibility in aggregating (and symetrically aggregating, what we used to call fan-trap avoidance in the old days) measure values.

view: v_health_stats {
sql_table_name: rittman_dw.v_health_stats ;;
dimension_group: date_day {
type: time
hidden: yes
timeframes: [
raw,
time,
date,
week,
month,
quarter,
year
]
sql: ${TABLE}.date_day ;;
}
dimension: pk {
type: string
hidden: yes
primary_key: yes
sql: ${TABLE}.pk ;;
}
measure: total_active_mins {
type: sum
sql: ${TABLE}.active_mins ;;
value_format_name: decimal_2
}
measure: avg_sleep_mins_avg_3_days {
type: average
label: "Avg Sleep in Mins"
sql: ${TABLE}.sleep_mins_avg_3_days ;;
value_format_name: decimal_2
}
measure: avg_weightKg_avg_7_days {
type: average
label: "Avg WeightKg"
sql: ${TABLE}.weightKg_avg_7_days ;;
value_format_name: decimal_2
}
measure: total_distance_km {
type: sum
sql: ${TABLE}.distance_km ;;
value_format_name: decimal_2
}
}

Once you’ve defined all your views, the next step is to define your model made up of what are termed “explores” — think of models as the equivalent of a universe in Business Objects or a repository in Oracle Business Intelligence, and explores as subject areas linking together through sets of views that join on common columns.

connection: "rittman_bigquery"
# include all the views
include: "*.view"
# include all the dashboards
include: "*.dashboard"
label: "Rittman DW"
explore: fluentd_log {
label: "Rittman BigQuery DW"
view_label: "1 - Home Metrics"
join: v_communications {
type: left_outer
view_label: "4 - Comms & Social Media Metrics"
sql_on: ${fluentd_log.date_date} = ${v_communications.date_date} ;;
relationship: many_to_one
fields: [v_communications.post_author,v_communications.post_body...]
}
join: v_health_stats {
type: left_outer
view_label: "2 - Health Metrics"
sql_on: ${fluentd_log.date_date} = ${v_health_stats.date_date} ;;
relationship: many_to_one
fields: [v_health_stats.total_steps, v_health_stats.total_distance_km...]
}
join: v_car_trips {
type: left_outer
view_label: "6 - Car Journey Metrics"
sql_on: ${v_car_trips.date_date} = ${fluentd_log.date_date} ;;
relationship: many_to_one
fields: [v_car_trips.fuel_econ_kpl...n]
}
}

Truth be told, the net effect of all this isn’t conceptually much different to the enterprise semantic layers I’ve built for Oracle Business Intelligence and other similar systems over the years, and LookML is in many ways really just today’s implementation of what nQuire did back in the late 90’s with Logical SQL and their logical and physical abstraction layers over physical data sources; but again, crucially, Looker’s LookML metadata layer was designed primarily for use through scripting and modern software development practices…

and aligns perfectly with the development lifecycle and tooling used by the types of millenial developers who wouldn’t be seen dead working with the sort of development lifecycle I used to present about a few years ago for an earlier generation of BI tools.


What this gives me once all the development work is complete and data is ready to be analyzed in looks and dashboards is something that looks like the screenshot below; note the semantic model on the left-hand side with subject areas corresponding to the BigQuery views but with further grouping, friendly labelling and organizing into measures and dimension hierarchies and attributes.

As Looker’s SQL queries the underlying BigQuery data store directly rather than an aggregated subset of that data, I can query right down to the lowest detailed-level events in the BigQuery dataset with new events streaming in all the time. Using Looker’s set of tabular, chart and mapping visualization options I can map out where I’ve been over period of time, what I did and correlate that with other data in the dataset, each view within the Looker explore joined together by date so it becomes one big dataset for analysis, as I tweeted about one evening last week when it all came together.

[embed]https://twitter.com/markrittman/status/857377608148897792[/embed]

Now it’s all running there’s more I’d like to do with the various Google Cloud Platform APIs including using their new Cloud Vision API to classify and make searchable all the family photos I’ve collected over the years, and there’s lots of Looker features and capabilities I’ve not yet explored in detail including Looker Actions, a concept very familiar from my days working with Oracle BI and its Action Framework. For now thought I thought it would be interesting to share my thoughts on this new platform and the development process from the perspective of a long-term traditional BI+DW developer and architect, feel free to add comments or ask questions if you’re thinking of starting out with BigQuery, Looker or Fluentd in the future.