Posts tagged Olap
Druid, Imply and Looker 5 bring OLAP Analysis to BigQuery’s Data Warehouse

Back in the good old days of on-premise data warehousing projects where data was structured, the software came on CDs and data engineering was called ETL development (and was considered the most boring job on a project, how times have changed) a typical technical architecture had sources on the left-hand side, a relational database hosting the data warehouse in the middle, and a multi-dimensional OLAP (or “MOLAP”) server on the right-hand side serving up data — fast — to specialist query and reporting tools.

Diagram courtesy of Readings in Database Systems, 3rd Edition, Stonebraker & Hellerstein, eds, 1996.

OLAP (Online Analytical Processing) Servers were a great complement to the relational database servers that hosted data warehouses back in those days by taking a subset of the whole dataset, structuring it into a dimensional model then storing it as a indexed arrays of leaf-level and pre-computed aggregates, served up results with split-second response times for any combination of dimension filters or level of aggregation requested … and users loved them.

The first ten years of my consulting career were spent working with OLAP Servers and if you’re interested in their history I’d recommend you check out Episode 37 of the Drill to Detail Podcast where I talked about arguably the last pure MOLAP Server still actively sold and implemented, Oracle’s Essbase Server, with Cameron Lackpour; Episode 34 of the Drill to Detail Podcast with Donald Farmer, the original Product Manager behind Microsoft Analysis Services and Episode 11 of the Drill to Detail Podcast with Graham Spicer, my original mentor and the single person most responsible for the nearly twenty-year career I’ve had since then in consulting, data warehousing and now product management, on Oracle’s original Express Server and then OLAP Option technologies.

But just like mainframes that reached perfection just at the time when PCs and mini-computers made them obsolete, OLAP Servers fell out of favour as data volumes and data types exploded whilst time available for loading them via batch processes just disappeared. On-premise data warehouses eventually transformed into elastic, cloud-hosted data warehouse platforms provided as fully-managed services such as Google BigQuery, Snowflake DB and Oracle Autonomous Data Warehouse Cloud Service and were accompanied by a new generation of BI tools like Looker, aimed at data-driven tech startups needing to analyze and understand vast amounts of consumer activity and other event-level behavioural data, as I talked about in my session at the recent Looker Join 2017 conference in San Francisco on Qubit, BigQuery and Looker for petabyte-scale analytics.

But BigQuery is, at the end of the day, a query and compute engine optimized for data warehouse-style queries and workloads albeit at a scale unimaginable ten years ago; Druid, an open-source project first announced in a white paper back in 2014 and now arguably the standard for new-world distributed data stores optimized this time for sub-second response times, may be the OLAP Server to BigQuery’s data warehouse.


To be clear, BigQuery and other distributed query engines like it are fast, particularly when filtering, sorting and aggregating single wide tables of columnar-organized data as you can see in the video below where I query and aggregate around four-and-a-half million smart device events to find out the average monthly temperature in each of the rooms in my house.

[embed]https://youtu.be/kd2403Pe4f4[/embed]

BigQuery supports joins between large tables, uses ANSI-standard SQL and more recently has benefited from a number of improvements to improve the response time for small queries as well as large ones, but compared to OLAP servers that typically pre-compute in-advance all the different aggregations and store data indexed and organized by the dimensions that users filter results by, it’s definitely a general-purpose database engine rather than a single-purpose OLAP server, and all query aggregations have to be computed on-the-fly.

Druid, originally authored by Eric Tschetter and Fangjin Yang at Metamarkets in 2011 and described in-detail in this white paper from 2014 explicitly re-implements key features of old-school OLAP servers by pre-aggregating incoming real-time streaming and batch data and storing it in a more compressed form, organizes that compressed data as time-based segments bitmap-indexed by dimensions and then presents data out as OLAP cubes to client applications.

Image courtesy of “What is Druid”, image downloaded in Oct. 2017

Druid has some significant limitations compared to more general-purpose analytic database engines such as BigQuery; it doesn’t support table joins right now (though it may do at the time you read this, as an open-source project it evolves rapidly), its primary client interface is JSON over HTTP, and most importantly for organizations that moved to BigQuery because it runs as infrastructure-as-a-service you have to take care of server upgrades, capacity scaling and all the other infrastructure management tasks that we thought we’d said goodbye to with data warehouse-as-a-service platforms.

But companies offering services and management tools to manage Druid as just another platform service are starting to emerge and courtesy of the Apache Calcite project it’s now possible to query Druid using regular SQL queries, a capability Looker recently took advantage of to offer Druid connectivity as one of the new features in their recent Looker 5 release, as you can see me demonstrating in the video below.

[embed]https://www.youtube.com/watch?v=IQ1Ce65HaHI[/embed]

But just as old-school OLAP servers worked best with query tools specifically designed to work with them, new open-source BI tools such as Superset (from the same engineering team at Airbnb that also brought us Airflow, my current ETL orchestration tool of choice) connect directly to Druid clusters and come close to their commercial rivals in terms of reporting and dashboard features offered to end users; in the video below you can see me creating a percentage-changed line-graph showing how the amount of time I spend cycling each month changed over time, using the same Druid datasource as in the other videos.

[embed]https://youtu.be/3uU10q7mPBo[/embed]

Superset, Looker and other BI tools that now support Druid are of course great but the one that’s really got my interest, and prompted me to look further into Druid and how it complements BigQuery and other data warehouse cloud platform cloud services is Imply, a startup launched by one of the original co-authors of Druid who, not unlike Looker who reinvented the enterprise BI platform for the big data and startup world, are reintroducing that same world to OLAP analysis whilst making the Druid back-end much easier to manage.


Imply runs either on-premise as open-source software you can download and then install on local or cloud-hosted VMs, or run as platform-as-a-service through a contract with Imply. Druid is one of the more complex and involved analytic database types to load but Imply’s cloud service makes it simple to spin-up, manage and then ingest data into your Druid cluster either as real-time streaming sources, or via batch loads from Amazon S3 or other popular datasources.

Images courtesy of Imply Cloud Quickstart docs page

I’ve got Imply running on my own Google Compute Engine infrastructure-as-a-service platform so take care of server management and data ingestion manually, but for me the standout feature in Imply’s platform is Pivot, their open-source OLAP query tool. If any reader is old enough to remember OLAP client tools such as Oracle Express Sales Analyzer and ProClarity you’ll recognize Pivot’s use of terms such as cubes, dimensions, aggregation types and measures as shown in the screenshot of my setup below…

… but more importantly, you’ll recognise the structured query environment and lightning-fast response to my queries against that same set of four-and-a-half million IoT and other events that I extracted from my BigQuery environment and then loaded and stored in compressed column-stored segments pre-aggregated and indexed by the same dimension fields I’m now analysing it by.

[embed]https://www.youtube.com/watch?v=Dj38w2nhNyI[/embed]

Well they say nothing’s new in fashion or music if you wait long enough, and sure enough as I said in my tweet a couple of years ago…

[embed]https://twitter.com/markrittman/status/638311769556025344[/embed]

… yes it does make me feel old, but it’s great to see such a powerful concept as multidimensional OLAP storage and dimensional models being rediscovered by the big data and startup worlds.

Analytic Views, Oracle Database 12.2

An interesting new feature came out with the 12c Release of Oracle Database Enterprise Edition called Analytic Views that looks, at first sight, to be an extension of the existing dimensions and materialized views features the Oracle database has had since the 8i release. For anyone who remembers those features being added to the Oracle Database back in the mid 90’s (and if you don’t, here’s a youthful-looking blogger writing about that topic on a soon-to-be-rather-controversial DBA website), the CREATE DIMENSION statement allowed you to define an aggregation rule for use by the database’s query optimizer, like this:

CREATE DIMENSION products
LEVEL product IS (products.prod_id, products.prod_desc, products.prod_list_price, products.prod_min_price, products.prod_name, products.prod_pack_size, products.prod_status, products.supplier_id, products.prod_unit_of_measure, products.prod_weight_class)
LEVEL subcategory IS (products.prod_subcategory, products.prod_subcat_desc)
LEVEL category IS (products.prod_category, products.prod_cat_desc)
HIERARCHY prod_hier (product CHILD OF subcategory CHILD OF category)
ATTRIBUTE product DETERMINES (products.prod_category, products.prod_id)
ATTRIBUTE subcategory DETERMINES (products.prod_category, products.prod_subcategory)
ATTRIBUTE category DETERMINES products.prod_category;

These aggregation rules could be thereafter be used by the optimizer to work out whether to rewrite the users’ query to use a pre-computed aggregate to answer that query instead of rolling-up from the detail-level the user specified in their query. Since then, subsequent releases of the Oracle Database have introduced a separate but related feature, the OLAP Option, that used technology originated from the old Express Server to create what were called Analytic Workspaces, fully-featured OLAP multi-dimensional caches together with their own programming language and suite of applications for sales and budgeting/planning.

Over the years Oracle introduced features to integrate analytic workspaces with the SQL query language used by the core database rather than developers having to learn and use OLAP DML, Oracle OLAP’s proprietary programming language, and the dimensional model provided by the OLAP Option and all OLAP servers is just as relevant today as it was back in the 1970’s when that technology first became available, and at that time and until recently the other innovation provided by dedicated multidimensional OLAP Servers — ultra-fast queries made possible by the use of a separate pre-computed and highly-indexed mid-tier data cache — was so valuable to users that the overhead in maintaining these caches, and the query latency they introduced because of the time it took to refresh and recompute the aggregates — made sense.

Until recently that is, when hardware used to host database servers suddenly became able to host the entire database in-memory and disk I/O was no longer a problem, and fast CPUs with multiple cores and large amounts of on-board cache RAM meant pre-computing calculated measures was no longer needed. So along came Oracle Database 12c and the In-Memory Option, that until the incident with the WiFi kettle was my claim to fame as the person Larry Ellison quoted on stage at Oracle when I said it was so easy to use it was “almost boring”.

Increasingly, it no longer made any sense to define and then maintain a separate multi-dimensional OLAP cache in a sort-of-embedded-server like Oracle OLAP, when the so-easy-to-use-it’s-boring Database In-Memory Option along with materialized views and bitmap indexes were just as fast to return query results to users … especially when this in-memory option also came with another feature, vector transformation plans, that enabled a special type of in-memory aggregation using the vector processing features in modern CPUs and had been developed — rather interestingly — by none other than the OLAP Option development team headed up by none other than … Bud Endress, the long-term development lead for Oracle OLAP and before that, Express Server.

So what’s going on here then? Well back then this adding of special aggregation capabilities into the in-memory database feature by the old OLAP Server team clearly signalled that this was where the Oracle intended to take the technology legacy from their standalone and then partly-integrated multidimensional OLAP server cache. but then the second shoe dropped back at Oracle Openworld two years ago when Analytic Views were first launched in a presentation by … Bud Endress.

[embed]https://twitter.com/markrittman/status/659776557523869696[/embed]

Analytic Views then came along with the 12.2 release of Oracle OLAP and fairly obviously were the other half to the puzzle of where Oracle were planning to take the people, ideas and inspiration that came out of Oracle OLAP and the twin core user benefits of a dimensional model, and fast response to user queries involving aggregation and slice-and-dice of structured datasets. Not only that though, they also included a new take on the simplified SQL query format that Oracle OLAP introduced a few years ago, gave users the ability to build out a complete Oracle BI-style business model right over the data in the database … and, as if that wasn’t enough surprises, support (via third-party Simba ODBC drivers) for MDX.


Taking the IoT, wearables and social media data I’ve been collecting and storing up until now in Google BigQuery and transferring that across to an Oracle Database 12c 12.2 database with Analytic Views and the In-Memory Option enabled, and using Oracle SQL Developer 4.2 with its new Analytic Views functionality now included, you can see that there is a new dedicated section for Analytic Views in the main interface not unlike the old Analytic Workspace Manager standalone tool from the Oracle 10g/11g days.

Creating a new analytic view involves one of two routes in SQL Developer 4.2 (or you could of course script it all using DDL statements); you can create a quick-and-dirty analytic view over a single table using the Quick Analytic View menu option as per the screenshot below that then introspects your data to determine candidate measure and dimension columns, but this doesn’t seem to handle the sorts of wide, multiple level-per-attribute dimension denormalized tables that are common in distributed query engines like Google BigQuery that prefer those over joining lots of separate tables;

or, as I did for my testing you can normalize your fact and dimension tables so they’re separate and joined via foreign key constraints and then create in turn your attribute dimensions, hierarchies and attribute views containing measures and aggregation rules as shown in the screenshots below.

So far, very much in the same vein as creating relational and OLAP Option cube structures in previous database versions and developer tools in the past, and you can view the commands that SQL Developer will then send to the database using one of the dialog tabs, giving you a set of DDL commands looking like this:

CREATE ATTRIBUTE DIMENSION DEVICE_DIMENSION 
DIMENSION TYPE STANDARD
CAPTION 'DEVICE DIMENSION'
DESCRIPTION 'DEVICE DIMENSION'
USING DEVICE_DIM
ATTRIBUTES (
DEVICE AS DEVICE
CAPTION 'DEVICE'
DESCRIPTION 'DEVICE' ,
DEVICETYPE AS DEVICETYPE
CAPTION 'DEVICETYPE'
DESCRIPTION 'DEVICETYPE' ,
ROOM AS ROOM
CAPTION 'ROOM'
DESCRIPTION 'ROOM'
)
LEVEL DEVICE_LEVEL
LEVEL TYPE STANDARD
CAPTION 'DEVICE LEVEL'
DESCRIPTION 'DEVICE LEVEL'
KEY ( "DEVICE")
MEMBER NAME DEVICE
MEMBER CAPTION DEVICE
MEMBER DESCRIPTION DEVICE
ORDER BY MIN DEVICE ASC NULLS FIRST
DETERMINES ( "DEVICE")
LEVEL ROOM_LEVEL
LEVEL TYPE STANDARD
CAPTION 'ROOM LEVEL'
DESCRIPTION 'ROOM LEVEL'
KEY ( "ROOM")
MEMBER NAME ROOM
MEMBER CAPTION ROOM
MEMBER DESCRIPTION ROOM
ORDER BY MIN ROOM ASC NULLS FIRST
DETERMINES ( "ROOM")
LEVEL DEVICETYPE
LEVEL TYPE STANDARD
CAPTION 'Devicetype'
DESCRIPTION 'Devicetype'
KEY ( "DEVICETYPE")
MEMBER NAME TO_CHAR("DEVICETYPE")
ORDER BY MIN DEVICETYPE ASC NULLS LAST
DETERMINES ( "DEVICETYPE")
ALL
MEMBER NAME 'ALL'
MEMBER CAPTION 'ALL'
MEMBER DESCRIPTION 'ALL';
CREATE ATTRIBUTE DIMENSION DATES_DIMENSION 
CAPTION 'DATES DIMENSION'
DESCRIPTION 'DATES DIMENSION'
USING DATES_DIM
ATTRIBUTES (
DATE_DATE AS DATE_DATE
CAPTION 'DATE_DATE'
DESCRIPTION 'DATE_DATE' ,
DAY_NAME AS DAY_NAME
CAPTION 'DAY_NAME'
DESCRIPTION 'DAY_NAME' ,
DAY_NUMBER AS DAY_NUMBER
CAPTION 'DAY_NUMBER'
DESCRIPTION 'DAY_NUMBER' ,
YEAR AS YEAR
CAPTION 'YEAR'
DESCRIPTION 'YEAR' ,
YEAR_DAY_OF_YEAR AS YEAR_DAY_OF_YEAR
CAPTION 'YEAR_DAY_OF_YEAR'
DESCRIPTION 'YEAR_DAY_OF_YEAR' ,
YEAR_MM AS YEAR_MM
CAPTION 'YEAR_MM'
DESCRIPTION 'YEAR_MM'
)
LEVEL DATE_DATE
CAPTION 'Date Date'
DESCRIPTION 'Date Date'
KEY ( "DATE_DATE")
MEMBER NAME TO_CHAR("DATE_DATE")
MEMBER CAPTION TO_CHAR("DATE_DATE")
MEMBER DESCRIPTION TO_CHAR("DATE_DATE")
ORDER BY MIN DATE_DATE ASC NULLS FIRST
DETERMINES ( "DAY_NAME" , "DAY_NUMBER" , "YEAR_DAY_OF_YEAR")
LEVEL YEAR
CAPTION 'Year'
DESCRIPTION 'Year'
KEY ( "YEAR")
ALTERNATE KEY YEAR
MEMBER NAME YEAR
MEMBER CAPTION YEAR
MEMBER DESCRIPTION YEAR
ORDER BY MIN YEAR ASC NULLS FIRST
LEVEL YEAR_MM
CAPTION 'Year Mm'
DESCRIPTION 'Year Mm'
KEY ( "YEAR_MM")
MEMBER NAME YEAR_MM
MEMBER CAPTION YEAR_MM
MEMBER DESCRIPTION YEAR_MM
ORDER BY MIN YEAR_MM ASC NULLS FIRST
ALL
MEMBER NAME 'ALL'
MEMBER CAPTION 'ALL'
MEMBER DESCRIPTION 'ALL';
CREATE HIERARCHY DATES_HIERARCHY 
CAPTION 'DATES_HIERARCHY'
DESCRIPTION 'DATES_HIERARCHY'
USING DATES_DIMENSION
( "DATE_DATE" CHILD OF "YEAR_MM" CHILD OF "YEAR" );
CREATE HIERARCHY DEVICE_TYPE_HIERARCHY 
CAPTION 'DEVICE_TYPE_HIERARCHY'
DESCRIPTION 'DEVICE_TYPE_HIERARCHY'
USING DEVICE_DIMENSION
( "DEVICE_LEVEL" CHILD OF "DEVICETYPE" );
CREATE HIERARCHY ROOM_HIERARCHY 
CAPTION 'ROOM_HIERARCHY'
DESCRIPTION 'ROOM_HIERARCHY'
USING DEVICE_DIMENSION
( "DEVICE_LEVEL" CHILD OF "ROOM_LEVEL" );
CREATE ANALYTIC VIEW IOT_DEVICES_AV 
CAPTION 'IOT_DEVICE_READINGS'
DESCRIPTION 'IOT_DEVICE_READINGS'
USING IOT_DEVICES
DIMENSION BY (
DEVICE_DIMENSION AS DEVICE_DIMENSION KEY ( "DEVICE") REFERENCES ( "DEVICE")
HIERARCHIES
(
DEVICE_TYPE_HIERARCHY AS DEVICE_TYPE_HIERARCHY DEFAULT,
ROOM_HIERARCHY AS ROOM_HIERARCHY
),
DATES_DIMENSION AS DATES_DIMENSION KEY ( "DATE_TIME") REFERENCES ( "DATE_DATE")
HIERARCHIES
(
DATES_HIERARCHY AS DATES_HIERARCHY DEFAULT
)
)
MEASURES (
VALUE FACT VALUE CAPTION 'VALUE'
DESCRIPTION 'VALUE'
)
DEFAULT MEASURE VALUE
DEFAULT AGGREGATE BY AVG;

But the bit that’s to my mind interesting in the Analytic View creation dialog is the Cache Measure Groups option in the left-hand menu of the SQL Developer screen.

This, and the vector transformation feature introduced with Oracle Database 12.1, are the key to how Analytic Workspaces will return aggregated query results as fast, if not faster, than using a dedicated multidimensional OLAP cache layer. Optimisations in the SQL parser for analytic views that on one hand enable simplified SQL queries without the need for GROUP BY or join clauses, like this:

SELECT time_hier.member_name as TIME,
sales,
sales_prior_period
FROM
sales_av HIERARCHIES(time_hier)
WHERE time_hier.level_name IN ('YEAR','QUARTER')
ORDER BY time_hier.hier_order;

also use these cache measure group settings to hard-wire the Analytic View into any pre-created materialized views containing aggregations of your measures, side-stepping the usual query rewrite mechanism used by the database query optimizer to ensure, along with vector transformations, that users have split-second response times to their queries whether at detail-level or aggregated across one or more of the analytic view hierarchies. You can try this feature out for yourself using Oracle’s Live SQL demo environment here, and it’s important to understand what this feature and the integration of analytic views with the in-memory database option with its vector transformation feature really means — OLAP aggregation rethough for today’s in-memory database servers.

So this, in my opinion, is where Oracle are headed in terms of meeting the needs of database customers who want fast, easy-to-maintain OLAP query functionality that provides the response time of a dedicated OLAP server but does it through leveraging the in-memory capabilities of Oracle Database 12c, the people and ideas from the OLAP Option product and the huge amounts of memory and compute capacity you get in modern servers. All very interesting, and interesting as well to see where the high-end traditional relational database vendors are taking OLAP analysis now old-school OLAP servers have had their day but dimensional analysis and dimensional business models are still just as relevant.