Druid, Imply and Looker 5 bring OLAP Analysis to BigQuery’s Data Warehouse

Back in the good old days of on-premise data warehousing projects where data was structured, the software came on CDs and data engineering was called ETL development (and was considered the most boring job on a project, how times have changed) a typical technical architecture had sources on the left-hand side, a relational database hosting the data warehouse in the middle, and a multi-dimensional OLAP (or “MOLAP”) server on the right-hand side serving up data — fast — to specialist query and reporting tools.

Diagram courtesy of Readings in Database Systems, 3rd Edition, Stonebraker & Hellerstein, eds, 1996.

OLAP (Online Analytical Processing) Servers were a great complement to the relational database servers that hosted data warehouses back in those days by taking a subset of the whole dataset, structuring it into a dimensional model then storing it as a indexed arrays of leaf-level and pre-computed aggregates, served up results with split-second response times for any combination of dimension filters or level of aggregation requested … and users loved them.

The first ten years of my consulting career were spent working with OLAP Servers and if you’re interested in their history I’d recommend you check out Episode 37 of the Drill to Detail Podcast where I talked about arguably the last pure MOLAP Server still actively sold and implemented, Oracle’s Essbase Server, with Cameron Lackpour; Episode 34 of the Drill to Detail Podcast with Donald Farmer, the original Product Manager behind Microsoft Analysis Services and Episode 11 of the Drill to Detail Podcast with Graham Spicer, my original mentor and the single person most responsible for the nearly twenty-year career I’ve had since then in consulting, data warehousing and now product management, on Oracle’s original Express Server and then OLAP Option technologies.

But just like mainframes that reached perfection just at the time when PCs and mini-computers made them obsolete, OLAP Servers fell out of favour as data volumes and data types exploded whilst time available for loading them via batch processes just disappeared. On-premise data warehouses eventually transformed into elastic, cloud-hosted data warehouse platforms provided as fully-managed services such as Google BigQuery, Snowflake DB and Oracle Autonomous Data Warehouse Cloud Service and were accompanied by a new generation of BI tools like Looker, aimed at data-driven tech startups needing to analyze and understand vast amounts of consumer activity and other event-level behavioural data, as I talked about in my session at the recent Looker Join 2017 conference in San Francisco on Qubit, BigQuery and Looker for petabyte-scale analytics.

But BigQuery is, at the end of the day, a query and compute engine optimized for data warehouse-style queries and workloads albeit at a scale unimaginable ten years ago; Druid, an open-source project first announced in a white paper back in 2014 and now arguably the standard for new-world distributed data stores optimized this time for sub-second response times, may be the OLAP Server to BigQuery’s data warehouse.


To be clear, BigQuery and other distributed query engines like it are fast, particularly when filtering, sorting and aggregating single wide tables of columnar-organized data as you can see in the video below where I query and aggregate around four-and-a-half million smart device events to find out the average monthly temperature in each of the rooms in my house.

[embed]https://youtu.be/kd2403Pe4f4[/embed]

BigQuery supports joins between large tables, uses ANSI-standard SQL and more recently has benefited from a number of improvements to improve the response time for small queries as well as large ones, but compared to OLAP servers that typically pre-compute in-advance all the different aggregations and store data indexed and organized by the dimensions that users filter results by, it’s definitely a general-purpose database engine rather than a single-purpose OLAP server, and all query aggregations have to be computed on-the-fly.

Druid, originally authored by Eric Tschetter and Fangjin Yang at Metamarkets in 2011 and described in-detail in this white paper from 2014 explicitly re-implements key features of old-school OLAP servers by pre-aggregating incoming real-time streaming and batch data and storing it in a more compressed form, organizes that compressed data as time-based segments bitmap-indexed by dimensions and then presents data out as OLAP cubes to client applications.

Image courtesy of “What is Druid”, image downloaded in Oct. 2017

Druid has some significant limitations compared to more general-purpose analytic database engines such as BigQuery; it doesn’t support table joins right now (though it may do at the time you read this, as an open-source project it evolves rapidly), its primary client interface is JSON over HTTP, and most importantly for organizations that moved to BigQuery because it runs as infrastructure-as-a-service you have to take care of server upgrades, capacity scaling and all the other infrastructure management tasks that we thought we’d said goodbye to with data warehouse-as-a-service platforms.

But companies offering services and management tools to manage Druid as just another platform service are starting to emerge and courtesy of the Apache Calcite project it’s now possible to query Druid using regular SQL queries, a capability Looker recently took advantage of to offer Druid connectivity as one of the new features in their recent Looker 5 release, as you can see me demonstrating in the video below.

[embed]https://www.youtube.com/watch?v=IQ1Ce65HaHI[/embed]

But just as old-school OLAP servers worked best with query tools specifically designed to work with them, new open-source BI tools such as Superset (from the same engineering team at Airbnb that also brought us Airflow, my current ETL orchestration tool of choice) connect directly to Druid clusters and come close to their commercial rivals in terms of reporting and dashboard features offered to end users; in the video below you can see me creating a percentage-changed line-graph showing how the amount of time I spend cycling each month changed over time, using the same Druid datasource as in the other videos.

[embed]https://youtu.be/3uU10q7mPBo[/embed]

Superset, Looker and other BI tools that now support Druid are of course great but the one that’s really got my interest, and prompted me to look further into Druid and how it complements BigQuery and other data warehouse cloud platform cloud services is Imply, a startup launched by one of the original co-authors of Druid who, not unlike Looker who reinvented the enterprise BI platform for the big data and startup world, are reintroducing that same world to OLAP analysis whilst making the Druid back-end much easier to manage.


Imply runs either on-premise as open-source software you can download and then install on local or cloud-hosted VMs, or run as platform-as-a-service through a contract with Imply. Druid is one of the more complex and involved analytic database types to load but Imply’s cloud service makes it simple to spin-up, manage and then ingest data into your Druid cluster either as real-time streaming sources, or via batch loads from Amazon S3 or other popular datasources.

Images courtesy of Imply Cloud Quickstart docs page

I’ve got Imply running on my own Google Compute Engine infrastructure-as-a-service platform so take care of server management and data ingestion manually, but for me the standout feature in Imply’s platform is Pivot, their open-source OLAP query tool. If any reader is old enough to remember OLAP client tools such as Oracle Express Sales Analyzer and ProClarity you’ll recognize Pivot’s use of terms such as cubes, dimensions, aggregation types and measures as shown in the screenshot of my setup below…

… but more importantly, you’ll recognise the structured query environment and lightning-fast response to my queries against that same set of four-and-a-half million IoT and other events that I extracted from my BigQuery environment and then loaded and stored in compressed column-stored segments pre-aggregated and indexed by the same dimension fields I’m now analysing it by.

[embed]https://www.youtube.com/watch?v=Dj38w2nhNyI[/embed]

Well they say nothing’s new in fashion or music if you wait long enough, and sure enough as I said in my tweet a couple of years ago…

[embed]https://twitter.com/markrittman/status/638311769556025344[/embed]

… yes it does make me feel old, but it’s great to see such a powerful concept as multidimensional OLAP storage and dimensional models being rediscovered by the big data and startup worlds.

Previous
Previous

Google Cloud Dataprep: Spreadsheet-Style Data Wrangling Powered by Google Cloud Dataflow

Next
Next

Using Google BigQuery, Google Cloud Natural Language API and Looker to Work-Out Exactly How Much…