… or is it more relevant than ever in the world of BiModal IT, and Mode 1 + Mode 2 Analytics?
You won’t be surprised to find that I think it’s the latter … as Gartner set out in their press releases and articles around Modern BI Platforms and Mode 1/Mode 2 analytics, the key now is for IT departments to enable users to perform their own analytics and explore data on their own, not by doing it themselves but by creating a platform, both on-premise and in the cloud, to make this possible.
OBIEE12c with it’s strong enterprise background and support for data discovery and external data mashups makes it the ideal platform for delivering Gartner’s vision of a BIModal modern BI platform — and if you’re not at the SIG meeting tomorrow, my slides are on Slideshare here :
Over the past few months I’ve been speaking at a number of user group events in Europe and the Middle East on the future of BI, analytics and data integration on the Hadoop and NoSQL platform. The original idea behind the presentation was to show people how BI on Hadoop isn’t just slow Apache Hive queries, offloaded ETL work or doing the same old things but with less mature tools. I’ve updated the slides after each presentation and delivered it for the final time today in Budapest, at the Budapest Data Forum organised by my good friend Bence Arato.
The final version of the slides are on Slideshare and you can view and download them using the widget below (and bear in mind I wrote most of the slides back in February 2016, when using Donald Trump to introduce Apache Spark made the audience laugh, not look at each other worried as it did today)
By the time I delivered this final iteration of the session there were so many “next-gen” Hadoop-on-SQL and BI-related technologies in the talk — Tez, Kudu, Spark, Apache Arrow, Apache Lens, Impala, RecordService to pick just a few — I thought it would be worth calling out what I think are the “Top 5” upcoming BI and SQL-related technologies and projects in my talk, starting with my #5…
5. Cloudera Impala, and Parquet Storage — Fast BI Queries for Hadoop
Most people using Hadoop for analytic workloads will be aware of Impala, but it’s importance can’t be understated as the easiest way to improve the response times of BI-type queries when a customer is currently using just Hive. Using a daemon-style architecture similar to the MPP/shared-nothing analytic RDBMSs, moving to Impala from Hive for analytic workloads is easy and most BI tool vendors support Impala as a data source…
…and coupled with the column-oriented Parquet file storage format, you’re getting close-to or better-than query performance than your old legacy data warehouse platform.
4. Apache Lens — BI Middleware for ROLAP + Business Metadata
When you move from a desktop BI tool such as Tableau or Qlikview to an enterprise BI platform such as Oracle BI Enterprise Edition or SAP Business Objects, one of the main benefits to users is the shared business metadata model they provide that abstracts away the various different data sources into more understandable conformed dimensional model.
Some provide additional middleware engines that turn this into a fully-fledged ROLAP (Relational OLAP) cube with a dimensionally-orientated query language that the middleware layer then translates into the specific SQL and MDX languages used by the data sources that map into it. Apache Lens, just out of incubator status and now a top-level Apache project, aims to do this for Hadoop.
And again paralleling the product architectures of Oracle Business Intelligence, IBM Cognos and SAP Business Objects, Hadoop now has its own MOLAP (Multidimensional OLAP) server — Apache Kylin — and one of the now en-vogue “notebook”-style query tools, Apache Zeppelin, making the next-generation Hadoop BI product architecture look something like the diagram below.
3. Apache Arrow — Unified In-Memory Storage Format for Hadoop
Whereas Hadoop used to be synonymous with MapReduce and writing everything to disk, Hadoop 2.0 through YARN and in-memory processing frameworks such as Apache Spark is now all about in-memory — and joins languages and toolkits such as R and Python’s Pandas in doing much of their work in-memory … but things often grind to a halt when you try to move data from one in-memory technology to another as Hadoop serialises and then deserialises the data to move it from one in-memory storage format to another.
Apache Arrow, more of a standard and set of back-end APIs than an end-user BI tool, aims to standardise these in-memory storage formats and take away this data exchange overhead — leading to far greater interoperability between Hadoop’s in-memory tools and languages.
2. Cloudera Kudu — Fast, Real-Time Storage Engine for BI on Hadoop
Up until now, if you needed to update, delete or even just insert single rows of data into an Apache Hive table, your best bet was to switch-out the table’s storage format from HDFS to the HBase, a type of NoSQL database that supports random access to individual storage cells and when combined with Hive using the HBase Storage Handler, gives developers a roundabout-way to perform these actions in a Hadoop environment — useful when you need to continue to load SCD2-type dimension tables as part of a data warehouse offload project. But HBase isn’t designed for performing the sort of large-scale aggregations and data selections common in data warehouse projects … and Parquet Storage, which is optimised for these types of queries, doesn’t handle trickle-feed data updates because of the columnar format it uses and the way it compresses data to better optimise it for queries.
Apache Kudu, coming out of Cloudera but now an incubating Apache project, aims to combine HBase’s insert, update and delete capabilities with the fast columnar storage format of Parquet. Combined with Impala and with support for other query tools and processing frameworks coming, Kudu is the most ambitious attempt to go beyond simple file-based storage in Hadoop and replace it with any optimised storage layer designed specifically for analytic workloads — and all of it still open-source and free to deploy on as many servers as you need — in contrast to the $1m or so that traditional RDBMS vendors charge for their high-end database servers.
1. Apache Drill — SQL over Self-Describing Data
Whilst Hive, and Impala, and all the other SQL-on-Hadoop engines that build on the metastore data dictionary framework that Hive provides are all great ways to run SQL processing on the Hadoop platform, you’re still essentially following the same workflow as you did when working on traditional RDBMS platforms — you need to formally define your table structures before you run your SQL queries, a complicated and often manual process that adds significant time to the data provisioning process for flexible-schema “data lakes” … even though many of the new storage formats we use in Hadoop environments such as Parquet, JSON documents or even CSV files have metadata stored within them. Apache Drill, closely associated with MapR but also now an Apache project, lets you query these “self-describing” data sources directly without requiring you to formally set up their metadata in the Hive metastore … and combine the results with regular Hive tables, data from HBase or even data from traditional database engines such as Oracle, Teradata or Microsoft SQL Server.
But it gets better … Drill has the ad-hoc query performance of Impala, Presto and other SQL engines optimised for BI-style workloads, making it an interesting future companion to Apache Hive and supporting data-discovery style applications up until now only possible through proprietary vendor tools like Oracle’s Endeca Information Discovery
Of course there’s many more new Hadoop projects coming down-the-line that are concerned with BI, along with hot new Hadoop startups such as Gluent that aim to bridge the gap between “old world” RDBMS and “new world” Hadoop and NoSQL storage and compute frameworks — a company and development team I suspect we’ll hear a lot more about in the future. For now though, those are the top 5 current and future BI technologies I’ve been talking about on my travels over the past six months … now on to something even more interesting.
Slides (and some embedded videos) from my session at the inaugural Danish BI Meetup in Copenhagen, “From lots of reports (with some data Analysis) to Massive Data Analysis (With some Reporting)” — on the future of BI, how Hadoop is the new analytic platform, and how the BI industry is changing with the advent of self-service data discovery tools.
Earlier today I was the guest of Christian Antognini at the Trivadis Tech Event 2016 in Zurich, Switzerland, presenting on the use of Oracle Big Data Discovery as the “data scientists’ toolkit”, using data from my fitness devices, smart home appliances and social media activity as the dataset. The slides are here:
Whilst the main focus of the talk was on the use of BDD for data tidying, data visualisation and machine learning, another part of the talk that people found interesting but that I couldn’t spend much time on was how I got the data into the Hadoop data lake I put together to hold the data. Whilst the main experiment I focused on in the talk just used a simple data export from the Jawbone UP website (https://jawbone.com) to get my workout, sleep and weight data, I’ve since gone on to bring a much wider set of personal data into the data lake via two main routes;
IFTTT (“If this, then that”) to capture activity from UP by Jawbone, Strava, Instagram, Twitter and other smartphone apps which it then sends in the form of JSON documents via HTTP GET web requests to a Logstash ingestion server running back home for eventual loading in HDFS, then Hive via Cloudera’s JSONSerde
A Samsung Smart Things “smartapp” that subscribes to smart device and sensor activity in my home, then sends those events as JSON documents again via HTTP GET web requests to Logstash, again for eventual storage into HDFS and Hive and hourly uploads into Big Data Discovery’s DGraph NoSQL engine
This all works well and means that my dashboards and analysis in Big Data Discovery are at most one-hour behind the actual activities that take place in my smartphone apps or in my home amongst the various connected heating thermostats, presence sensors and other devices eventually feeding their activity through my Samsung Smart Things hub and the smartapp that I’ve got running in Samsung’s Smart Things cloud service.
This additional data loading and storage route now gives me metrics in actual real-time (not the one-hour delay caused by BDD’s DGraph hourly batch load), and gives me the total and complete set of events and not the representative sample that Big Data Discovery normally brings in via a DGraph data processing CLI load.
Eventually the aim is to connect it all up via Kafka, but for now it’s a lovely sunny day in Zurich, I’m meeting Christian Berg for a beer and lunch in 30 minutes, so I think it’s time for a well-deserved afternoon-off, and then prep for my final speaking commitment before Oracle Openworld next week … Dublin on Monday to talk about Oracle Big Data.