Posts tagged Hadoop
BigQuery, Looker and Big Data’s Rediscovery of Data Warehousing and Semantic Models … at Google…

Hadoop solved two problems data warehouse owners faced in the mid-2000’s; how to handle the explosion in data volumes brought on by the digitisition of business, and how to bring agility and flexibility into the way data was landed, processed and made available to users.

Traditional data warehousing database server technology got more expensive and unreliable the more nodes you added to a cluster whereas Hadoop became more resilient, and as open-source technology it was free to download and run on existing commodity servers. NoSQL databases emphasised speed over immediate consistency and offered storage options other than just tables and columns for applications that needed to load data in real-time and datatypes were increasingly unstructured and non-relational.

The Hadoop and NoSQL ecosystem grew over the next ten years to include technologies such Hive, Kafka and Sqoop that took concepts well-established in the database world and rethought them for the distributed, polyglot datasource and flexible-processing world of big data.

Most, like Hive, have areas of functionality clearly still years behind their commercial equivalent but focus instead on new possibilities opened-up by Hadoop’s flexible architecture and the ability of its open-source developer community to rapidly iterate and fill-in missing features as needs became apparent.

Step-forward to two-or-three years ago and most large commercial organizations had big data initiatives of one form or another hosted in their data centres, using Hadoop distributions from the likes of Cloudera and Hortonworks on white-box hardware or dedicated server appliances from Oracle and IBM. In-house teams managed the often hundreds or thousands of nodes in a cluster, and over time reference architectures emerged that added governance and workload-separation concepts demanded by enterprise customers, along with design patterns that bridged the gap between fast-moving streaming data sources and the slower, batch-oriented world of traditional on-premise data-processing.

And so Hadoop and NoSQL technologies increasingly became the default choice organizations made when storing and organizing data for query and analysis as they were cheaper, more flexible in-terms of how and when you processed data and were orders of magnitude cheaper to license and provision than the high-end relational databases servers they’d used up until. I’m giving at talk on this topic later this week at the UK Oracle User Group’s Database SIG in London, and the problems each Hadoop component solved and the workloads that still make sense to run on traditional relational database technologies are key concepts DBAs and developers need to be aware of when thinking about these new technologies.

But as with any new technology that makes an older one obsolete Hadoop and NoSQL introduced their own new problems, and as we discussed in the Drill to Detail Podcast episode featuring colleague Alex Olivier in this case they were around the complexity and effort required to manage clusters containing thousands of nodes running dozens of services doing so economically and securely, and in addition the cognitive overhead put on users by Hadoop and NoSQL’s schema-on-read storage meant pushing the burden of structuring and interpreting data onto users who didn’t have the skills or time do this effectively.

The first generation of cloud platforms for running big data workloads typically took what was done on-premise and lifted-and-shifted that workload into the cloud replacing physical servers with virtual ones, giving system owners more capacity and flexibility around server provisioning but still requiring the customer to own and manage the environment themselves. The first cloud database server platforms were most likely to be straight ports of that vendor’s on-premise product hosted in VMs running on that cloud vendor’s Infrastructure-as-a-Service (Iaas) platform, but a new generation of elastic, cloud-native data warehouse Platform-as-a-Service products such as Google BigQuery and Amazon Athena bring together the benefits of cloud and big data scale, come with their own similarly elastically-provisioned data processing and data ingestion services while re-introducing concepts such as tabular storage and SQL access and

On-premise technologies such as Apache Kudu started this move towards structuring the storage big data systems used for analytic workloads as tables and columns rather than schema-less filesystems, but BigQuery took this mainstream to the point where organizations such as Qubit, the London startup I featured on the Drill to Detail episode and where I’m currently responsible for their analytics product strategy, use it to ingest, process and stored hundreds of terabytes of customer event data each day and store it neatly organized into tables, columns, projects and industry-specific business views.

If you logged into BigQuery’s SQL web interface and didn’t know about the massively-parallel big data architecture BigQuery took from Dremel you’d assume it was just another database server, and more importantly you can get on with your analysis rather than struggling with JSON SerDes and Hive EXPLODE and LATERAL VIEW clauses to decode your schema-on-read dataset. In today’s service-based PaaS cloud analytics platforms the two previously distinct relational data warehousing and Hadoop distributed processing technologies are converging with cloud storage and elastically-provisioned compute and storage to give us something new and potentially very interesting … just as new BI tool startup Looker coming out of the big data and e-commerce space has begun introducing a new generation of software developers and data architects to the benefits of BI data modeling, semantic models and data-layer abstraction.

The trouble with using tools such as Tableau and Oracle Data Vizualization Desktop against even tabular-structured data sources like Google BigQuery for business-user analysis is that they treat each report as it’s own silo of data; users can’t easily navigate from one area of data to another as reports are based on single data extracts of metrics and attributes from one subject area designed to make data visualizations and data discovery simple for departmental users.

Traditional enterprise BI tools like Business Objects and Cognos do give users the ability to create rich, integrated semantic business models over platforms such Google BigQuery using point-and-click graphical tools, but in the kinds of startups and engineering-led early-adopter companies implementing BigQuery and wanting to structure and join-together reporting data as create enterprise semantic models they’re more interested in working with markup languages, Git repositories and doing it all in code.

LookML, a language developed by Looker for describing dimensions, aggregates, calculations and data relationships in a SQL database is used to define semantic models in Looker’s BI tool and is aimed squarely at the software engineers typically working on the e-commerce and big data platforms that use BigQuery to host and analyze big data customer activity datasets. The code snippet below shows some simple LookML used to define an orders view in a Looker semantic model, and developers are encouraged to share code examples on Looker’s community Looker Blocks Directory.

We’re using Looker at Qubit to create semantic models for use with Live Tap, an analytics service that runs on top of Google BigQuery that marketers use to analyze and explore the event-level customer activity datasets we build for them. Although BigQuery provides SQL access and a web UI for running queries we knew marketers would prefer to use a graphical BI tool, and Looker and the sample code we’re planning on sharing on Github for customers to use and extend in their own Looker environments means we can present all of their data joined-up, with meaningful business names for metrics and dimension attributes and organized into subject areas that hide the complexity of the underlying BigQuery table structures.

And so the story ends up going full loop; big data solves the scale and agility issues that were holding back data warehouses, but added complexity that in-time was addressed by running that infrastructure in the cloud as a serviceeventually was addressed by running it as a service in the cloud, and now in-turn adopts some of the characteristics of the platforms it usurped to make developers more productive and help end-users understand and analyze all their corporate data as one big joined-up dataset.

Google Cloud’s Next developer event is running in San Francisco in a couple of week’s time and it’ll be interesting to see what’s coming next for BigQuery, and in another case of big data adopting traditional database capabilities I’ll definately be interested in hearing about Google Cloud Spanner — and how it could do for OLTP what MapReduce and Dremel did for our data warehousing relational databases.

Drill to Detail Podcast : Looking Back at 2016, and What’s New and Planned for 2017

I started the Drill to Detail Podcast series back in October this year with an inaugural episode featuring long-time friend and colleague Stewart Bryson talking about changes in the BI industry, and since then we’ve gone on to publish one episode a week featuring guests such as Oracle’s Paul Sonderegger on data capital, Cloudera’s Mike Percy on Apache Kudu followed shortly after by Mark Grover on Apache Spark, Dan McClary from Google talking about SQL-on-Hadoop and Neeraja Rentachintala from MapR telling us about Apache Drill and Apache Arrow, Tanel Poder from Gluent on the New World of Hybrid Data …

… along with many other guests including Jen Underwood, Kent Graziano, Pat Patterson from StreamSets and old friends and colleagues Andrew Bond, Graham Spicer and most recently for the Christmas and New Year special, Robin Moffatt.

In fact I’d only ever planned on publishing new episodes of Drill to Detail once every two weeks along the lines of the two podcasts that inspired Drill to Detail, the Apple-focused The Talk Show by John Gruber and Marco Arment, Casey Liss, and John Siracusa’s Accidental Tech Podcast), but what with a number of episodes recorded over the summer waiting for the October launch and so many great guests coming on over the new few months we ended-up publishing new episodes every week.

So at the end of 2016 and with fourteen episodes published on this website and on the iTunes directory I’d like to take this opportunity to thank all the guests that came on the show along with friends in the industry such as Confluent’s Gwen Shapira who helped get the word out and make introductions, and of course most importantly I’d like to thank everyone who’s downloaded episodes of the show, mentioned it on Twitter and other social networks and increasingly, subscribed to the show on the iTunes store to the point where we’re typically hitting a thousand or more subscribers each week based on Squarespace’s estimate of overall RSS subscriber numbers including those coming in from iTunes and other feed aggregators.

And if you’re wondering which show had the highest audience numbers it was November’s Episode 7 with Cloudera’s Mark Grover on Apache Spark and Hadoop Application Architectures, closely followed by October’s episode with Oracle’s Big Data Strategist Paul Sonderegger on data capital, both of which were great examples of what ended-up being the recurring theme and area of discussion with every one of the guests and shows we recorded … the business, strategy, rationale and opportunities for competitive advantage coming out of innovations in the big data and analytics industry.

And now we’re going into 2017 and the second year of Drill to Detail, we’re going to double-down on this area of focus by updating the Drill to Detail website with a new look and launching the new Drill to Detail Blog to accompany the podcast series, each week posting a long-form blog post looking at the business and strategy behind what’s happening in the big data, analytics and cloud-based data processing industry.

We’ll still be continuing with the podcast series exactly as they are now with guests including Elastic’s Mark Walkom and Cindi Howson from Gartner due on the show in January, but these longer-form blog posts give us a chance to explore and explain in a more structured way the topics and questions raised by what’s been discussed on the podcast, analyzing and exploring the implications from trends and directions coming out of the industry.

Finally, going back to my original inspiration for the podcast that started all of this, a big part of the inspiration and idea to focus on this particular theme came from what’s now become my new favourite blog and podcast series, Ben Thompson’s Stratechery website and Exponent podcast that he co-authors with James Allworth, and if I manage to get even someway towards the insights and understanding he brings towards the wider IT landscape and apply that to the part of the industry we work in during the coming year … well that’ll be my evenings, commute time and weekend time well spent this coming year.

Data Capital, Competitive Strategy and the Economics of Big Data — Drill to Detail Podcast Ep.6

Every guest on the Drill to Detail podcast has been a pleasure to interview, from Stewart Bryson on the inaugural episode through Dan McClary, Mike Percy, Kent Graziano, Andrew Bond and later this week Cloudera and Apache Spark’s Mark Grover, but one recording I was particularly looking forward to was last week’s guest Paul Sonderegger, ex-Endeca and currently Oracle’s Big Data Strategist talking to their customers about a concept he’s termed “Data Capital” … and what this new form of capital means for competitive strategy and company valuations.


If you (like me, secretly) thought Oracle’s previous “Digitisation and Datification” slidedeck was a bit … handwavy and corporate marketing b*llocks, well this is where it all comes together and makes sense. If you work in consulting or are looking for some sort of economic rationale and underpinning for all this investment in big data technology, and sometimes wonder why Netflix and Google are valued higher than CBS and your local newspaper, here’s your answer. A great episode exploring the business value of big data, not just the technical benefits.

And coming soon on another future episode … MapR. Watch this space.

Drill to Detail Ep.3 with Mike Percy on Apache Kudu

Episode 3 of the Drill to Detail podcast is now live on the podcast website and available for download on iTunes, and this week I’m very pleased to be joined by Cloudera’s Mike Percy, software engineer and lead evangelist within Cloudera for Apache Kudu, the new Cloudera-sponsored column-store data layer that takes the best features from HBase and Parquet and creates a storage layer specifically optimized for analytics.

The problem that Kudu solves is something that becomes apparent to most Hadoop developers creating analytic applications that need to support BI-type query workloads against data arriving in real-time from streaming sources; whilst column-orientated file formats like Apache Parquet are great for supporting BI-style workloads they’re not that good for handling streaming data, and while HBase adds support for single-row inserts, updates and deletes to Hive, queries that require aggregation up from cell level don’t perform all that well, such that most projects I’ve worked on copy data from HBase into a format such as parquet before presenting that data out to users for query.

Apache Kudu, as Mike Percy explains in this video of one of his presentations on Kudu back in 2015, takes the “fast data” part of HBase and adds the “fast query” capability you get with column-store formats like parquet, and for Hadoop platforms that need to support this type of workload the aim is that it replaces HDFS as a more optimized form of storage for this type of workload and dataset.


In-practice you tend to use Kudu as the storage format for Cloudera Impala queries, with Impala then gaining INSERT, UPDATE and DELETE capabilities, or you can do what I’ve been doing recently and use a tool such as StreamSets to load data into Kudu as just another destination type, as I’m doing in the screenshot below where home IoT sensor data lands in real-time into Kudu via Streamsets, and can be queried immediately using Impala SQL and a tool such as Hue or Oracle Data Visualization Desktop.

So thanks to Mike Percy and Cloudera for coming on this latest edition of the show, and you can read more about Kudu and the milestone 1.0 release on the Cloudera Vision blog.

Drill to Detail Ep.2. “The Future of SQL-on-Hadoop” with Special Guest Dan McClary

This is a good one.

Most of you will know Dan McClary as the product manager at Oracle for Big Data SQL, and more recently he’s now moved to Google to work on their storage and big data projects. If you’ve met Dan or heard him speak you’ll know he’s not only super-smart and very knowledgable about Hadoop, but he’s great to get into a conversation with .. which is why I was particularly pleased to have him on as the special guest on the Episode 2 of my new podcast series, Drill to Detail.

In this new episode Dan and I discuss the state of the SQL-on-Hadoop market and where he’s seeing the innovation; how the mega-vendors are contributing to, extending and competing with the Hadoop ecosystem; and what he’s seeing coming out of the likes of Google, Yahoo and other Hadoop innovators that may well make its way into the next Apache Hadoop project. You can download the podcast recording from the Drill to Detail website, and it’s just about to go live on iTunes where you can subscribe and automatically receive future episodes — and you certainly won’t want to miss the next one, believe me.

Presenting Second in the Gluent New World Webinar Series, on SQL-on-Hadoop Concepts and…

I’m very pleased to be delivering the second in the Gluent New World webinar series, a program of webinars organised by Oracle ACE Director and Gluent CEO Tanel Poder. Tanel opened the series with a talk about in-memory processing for databases, and I’ll be continuing the series with a session on the rationale and core technical concepts behind my current area of focus — SQL-on-Hadoop.

I was initially sceptical when I heard about Apache Hive and vendor-specific implementations but set-based processing, aggregation and querying is a core requirement for virtually every data processing platform — and over the years the core Hive and HCatalog foundation has evolved into technologies such as Spark SQL for large-scale processing, Cloudera Impala and Apache Drill for more interactive workloads, and specialised storage layers and file formats to support the different types of SQL workloads.

I’ll also take a look at what new technologies and capabilities are “coming around the corner”, and preview some of the ideas and demos I’ll be showing a the upcoming Enkitec E4 Conference in Barcelona later this month where I’ll be speaking on the same topic. Registration is free and you can sign up here:

Update : The slides from the event are below:

Last-Stop Budapest … And Five New BI and Analytics Technologies Coming Soon for Hadoop

Over the past few months I’ve been speaking at a number of user group events in Europe and the Middle East on the future of BI, analytics and data integration on the Hadoop and NoSQL platform. The original idea behind the presentation was to show people how BI on Hadoop isn’t just slow Apache Hive queries, offloaded ETL work or doing the same old things but with less mature tools. I’ve updated the slides after each presentation and delivered it for the final time today in Budapest, at the Budapest Data Forum organised by my good friend Bence Arato.

The final version of the slides are on Slideshare and you can view and download them using the widget below (and bear in mind I wrote most of the slides back in February 2016, when using Donald Trump to introduce Apache Spark made the audience laugh, not look at each other worried as it did today)


By the time I delivered this final iteration of the session there were so many “next-gen” Hadoop-on-SQL and BI-related technologies in the talk — Tez, Kudu, Spark, Apache Arrow, Apache Lens, Impala, RecordService to pick just a few — I thought it would be worth calling out what I think are the “Top 5” upcoming BI and SQL-related technologies and projects in my talk, starting with my #5…

5. Cloudera Impala, and Parquet Storage — Fast BI Queries for Hadoop

Most people using Hadoop for analytic workloads will be aware of Impala, but 
it’s importance can’t be understated as the easiest way to improve the response times of BI-type queries when a customer is currently using just Hive. Using a daemon-style architecture similar to the MPP/shared-nothing analytic RDBMSs, moving to Impala from Hive for analytic workloads is easy and most BI tool vendors support Impala as a data source…

…and coupled with the column-oriented Parquet file storage format, you’re getting close-to or better-than query performance than your old legacy data warehouse platform.

4. Apache Lens — BI Middleware for ROLAP + Business Metadata

When you move from a desktop BI tool such as Tableau or Qlikview to an enterprise BI platform such as Oracle BI Enterprise Edition or SAP Business Objects, one of the main benefits to users is the shared business metadata model they provide that abstracts away the various different data sources into more understandable conformed dimensional model.

Some provide additional middleware engines that turn this into a fully-fledged ROLAP (Relational OLAP) cube with a dimensionally-orientated query language that the middleware layer then translates into the specific SQL and MDX languages used by the data sources that map into it. Apache Lens, just out of incubator status and now a top-level Apache project, aims to do this for Hadoop.

And again paralleling the product architectures of Oracle Business Intelligence, IBM Cognos and SAP Business Objects, Hadoop now has its own MOLAP (Multidimensional OLAP) server — Apache Kylin — and one of the now en-vogue “notebook”-style query tools, Apache Zeppelin, making the next-generation Hadoop BI product architecture look something like the diagram below.

3. Apache Arrow — Unified In-Memory Storage Format for Hadoop

Whereas Hadoop used to be synonymous with MapReduce and writing everything to disk, Hadoop 2.0 through YARN and in-memory processing frameworks such as Apache Spark is now all about in-memory — and joins languages and toolkits such as R and Python’s Pandas in doing much of their work in-memory … but things often grind to a halt when you try to move data from one in-memory technology to another as Hadoop serialises and then deserialises the data to move it from one in-memory storage format to another.

Apache Arrow, more of a standard and set of back-end APIs than an end-user BI tool, aims to standardise these in-memory storage formats and take away this data exchange overhead — leading to far greater interoperability between Hadoop’s in-memory tools and languages.

2. Cloudera Kudu — Fast, Real-Time Storage Engine for BI on Hadoop

Up until now, if you needed to update, delete or even just insert single rows of data into an Apache Hive table, your best bet was to switch-out the table’s storage format from HDFS to the HBase, a type of NoSQL database that supports random access to individual storage cells and when combined with Hive using the HBase Storage Handler, gives developers a roundabout-way to perform these actions in a Hadoop environment — useful when you need to continue to load SCD2-type dimension tables as part of a data warehouse offload project. But HBase isn’t designed for performing the sort of large-scale aggregations and data selections common in data warehouse projects … and Parquet Storage, which is optimised for these types of queries, doesn’t handle trickle-feed data updates because of the columnar format it uses and the way it compresses data to better optimise it for queries.

Apache Kudu, coming out of Cloudera but now an incubating Apache project, aims to combine HBase’s insert, update and delete capabilities with the fast columnar storage format of Parquet. Combined with Impala and with support for other query tools and processing frameworks coming, Kudu is the most ambitious attempt to go beyond simple file-based storage in Hadoop and replace it with any optimised storage layer designed specifically for analytic workloads — and all of it still open-source and free to deploy on as many servers as you need — in contrast to the $1m or so that traditional RDBMS vendors charge for their high-end database servers.

1. Apache Drill — SQL over Self-Describing Data

Whilst Hive, and Impala, and all the other SQL-on-Hadoop engines that build on the metastore data dictionary framework that Hive provides are all great ways to run SQL processing on the Hadoop platform, you’re still essentially following the same workflow as you did when working on traditional RDBMS platforms — you need to formally define your table structures before you run your SQL queries, a complicated and often manual process that adds significant time to the data provisioning process for flexible-schema “data lakes” … even though many of the new storage formats we use in Hadoop environments such as Parquet, JSON documents or even CSV files have metadata stored within them. Apache Drill, closely associated with MapR but also now an Apache project, lets you query these “self-describing” data sources directly without requiring you to formally set up their metadata in the Hive metastore … and combine the results with regular Hive tables, data from HBase or even data from traditional database engines such as Oracle, Teradata or Microsoft SQL Server.

But it gets better … Drill has the ad-hoc query performance of Impala, Presto and other SQL engines optimised for BI-style workloads, making it an interesting future companion to Apache Hive and supporting data-discovery style applications up until now only possible through proprietary vendor tools like Oracle’s Endeca Information Discovery

Of course there’s many more new Hadoop projects coming down-the-line that are concerned with BI, along with hot new Hadoop startups such as Gluent that aim to bridge the gap between “old world” RDBMS and “new world” Hadoop and NoSQL storage and compute frameworks — a company and development team I suspect we’ll hear a lot more about in the future. For now though, those are the top 5 current and future BI technologies I’ve been talking about on my travels over the past six months … now on to something even more interesting.