Last-Stop Budapest … And Five New BI and Analytics Technologies Coming Soon for Hadoop
Over the past few months I’ve been speaking at a number of user group events in Europe and the Middle East on the future of BI, analytics and data integration on the Hadoop and NoSQL platform. The original idea behind the presentation was to show people how BI on Hadoop isn’t just slow Apache Hive queries, offloaded ETL work or doing the same old things but with less mature tools. I’ve updated the slides after each presentation and delivered it for the final time today in Budapest, at the Budapest Data Forum organised by my good friend Bence Arato.
The final version of the slides are on Slideshare and you can view and download them using the widget below (and bear in mind I wrote most of the slides back in February 2016, when using Donald Trump to introduce Apache Spark made the audience laugh, not look at each other worried as it did today)
[embed]https://www.slideshare.net/rittmanmead/sqlonhadoop-for-analytics-bi-what-are-my-options-whats-the-future[/embed]
By the time I delivered this final iteration of the session there were so many “next-gen” Hadoop-on-SQL and BI-related technologies in the talk — Tez, Kudu, Spark, Apache Arrow, Apache Lens, Impala, RecordService to pick just a few — I thought it would be worth calling out what I think are the “Top 5” upcoming BI and SQL-related technologies and projects in my talk, starting with my #5…
5. Cloudera Impala, and Parquet Storage — Fast BI Queries for Hadoop
Most people using Hadoop for analytic workloads will be aware of Impala, but
it’s importance can’t be understated as the easiest way to improve the response times of BI-type queries when a customer is currently using just Hive. Using a daemon-style architecture similar to the MPP/shared-nothing analytic RDBMSs, moving to Impala from Hive for analytic workloads is easy and most BI tool vendors support Impala as a data source…
…and coupled with the column-oriented Parquet file storage format, you’re getting close-to or better-than query performance than your old legacy data warehouse platform.
4. Apache Lens — BI Middleware for ROLAP + Business Metadata
When you move from a desktop BI tool such as Tableau or Qlikview to an enterprise BI platform such as Oracle BI Enterprise Edition or SAP Business Objects, one of the main benefits to users is the shared business metadata model they provide that abstracts away the various different data sources into more understandable conformed dimensional model.
Some provide additional middleware engines that turn this into a fully-fledged ROLAP (Relational OLAP) cube with a dimensionally-orientated query language that the middleware layer then translates into the specific SQL and MDX languages used by the data sources that map into it. Apache Lens, just out of incubator status and now a top-level Apache project, aims to do this for Hadoop.
And again paralleling the product architectures of Oracle Business Intelligence, IBM Cognos and SAP Business Objects, Hadoop now has its own MOLAP (Multidimensional OLAP) server — Apache Kylin — and one of the now en-vogue “notebook”-style query tools, Apache Zeppelin, making the next-generation Hadoop BI product architecture look something like the diagram below.
3. Apache Arrow — Unified In-Memory Storage Format for Hadoop
Whereas Hadoop used to be synonymous with MapReduce and writing everything to disk, Hadoop 2.0 through YARN and in-memory processing frameworks such as Apache Spark is now all about in-memory — and joins languages and toolkits such as R and Python’s Pandas in doing much of their work in-memory … but things often grind to a halt when you try to move data from one in-memory technology to another as Hadoop serialises and then deserialises the data to move it from one in-memory storage format to another.
Apache Arrow, more of a standard and set of back-end APIs than an end-user BI tool, aims to standardise these in-memory storage formats and take away this data exchange overhead — leading to far greater interoperability between Hadoop’s in-memory tools and languages.
2. Cloudera Kudu — Fast, Real-Time Storage Engine for BI on Hadoop
Up until now, if you needed to update, delete or even just insert single rows of data into an Apache Hive table, your best bet was to switch-out the table’s storage format from HDFS to the HBase, a type of NoSQL database that supports random access to individual storage cells and when combined with Hive using the HBase Storage Handler, gives developers a roundabout-way to perform these actions in a Hadoop environment — useful when you need to continue to load SCD2-type dimension tables as part of a data warehouse offload project. But HBase isn’t designed for performing the sort of large-scale aggregations and data selections common in data warehouse projects … and Parquet Storage, which is optimised for these types of queries, doesn’t handle trickle-feed data updates because of the columnar format it uses and the way it compresses data to better optimise it for queries.
Apache Kudu, coming out of Cloudera but now an incubating Apache project, aims to combine HBase’s insert, update and delete capabilities with the fast columnar storage format of Parquet. Combined with Impala and with support for other query tools and processing frameworks coming, Kudu is the most ambitious attempt to go beyond simple file-based storage in Hadoop and replace it with any optimised storage layer designed specifically for analytic workloads — and all of it still open-source and free to deploy on as many servers as you need — in contrast to the $1m or so that traditional RDBMS vendors charge for their high-end database servers.
1. Apache Drill — SQL over Self-Describing Data
Whilst Hive, and Impala, and all the other SQL-on-Hadoop engines that build on the metastore data dictionary framework that Hive provides are all great ways to run SQL processing on the Hadoop platform, you’re still essentially following the same workflow as you did when working on traditional RDBMS platforms — you need to formally define your table structures before you run your SQL queries, a complicated and often manual process that adds significant time to the data provisioning process for flexible-schema “data lakes” … even though many of the new storage formats we use in Hadoop environments such as Parquet, JSON documents or even CSV files have metadata stored within them. Apache Drill, closely associated with MapR but also now an Apache project, lets you query these “self-describing” data sources directly without requiring you to formally set up their metadata in the Hive metastore … and combine the results with regular Hive tables, data from HBase or even data from traditional database engines such as Oracle, Teradata or Microsoft SQL Server.
But it gets better … Drill has the ad-hoc query performance of Impala, Presto and other SQL engines optimised for BI-style workloads, making it an interesting future companion to Apache Hive and supporting data-discovery style applications up until now only possible through proprietary vendor tools like Oracle’s Endeca Information Discovery
Of course there’s many more new Hadoop projects coming down-the-line that are concerned with BI, along with hot new Hadoop startups such as Gluent that aim to bridge the gap between “old world” RDBMS and “new world” Hadoop and NoSQL storage and compute frameworks — a company and development team I suspect we’ll hear a lot more about in the future. For now though, those are the top 5 current and future BI technologies I’ve been talking about on my travels over the past six months … now on to something even more interesting.