The news broke late on Friday that Thomas Kurian is joining Google to become their new CEO for Google Cloud Platform. Five thoughts on Kurian’s move:
It’s a smart move made by an exceptionally smart guy. Brave, as well, given the recent history between Oracle and Google but also not surprising given his drive and presumably point to prove. I met him a few times as part of Oracle’s ACE Director program and he’s the only software exec I know who can talk long-term strategy and vision one minute and then explain to you in-detail how it all works, and doing it all with great humility and courtesy.
But Oracle will survive and this has happened many times before - Ray Lane, Charles Philips, Tom Seibel, Mark Benioff all left and in many cases founded massively successful and competitive businesses, client-server went to internet architecture and then internet went to cloud, it’s all part of how businesses renew and people move on and up, and there’s plenty more smart (and most likely, smarter) people left in Oracle and Larry Ellison is still just as driven, competitive and on-top of his game.
Look out for a very interesting (as in Brexit, interesting to watch but not to be part of) culture clash at GCP, with TK about the most top-down leader of a product development team I’ve ever seen and Google, famously, engineering-focused beanbag-friendly and bottom-up. Add to that the feelings most Googler’s have towards Oracle and TK will have his work cut-out early on to win the various teams over - my guess is that his work-ethic, technical chops and sheer ability will do just that and if this role is a success, Sergey and Larry may well have found their new Eric Schmidt but this time with technical skills to match theirs - but there’s always the chance that culture will prevail and he’ll be the next Marissa Meyer instead. No pressure there then..
Expect to see GCP moving increasingly into SaaS B2B areas such as ERP, CRM, Financials and industry-vertical applications to complement their commoditised IaaS and PaaS cloud business and leveraging their massive GSuite and increasingly ChromeOS install base. Just think what they could do if they had access to all the world’s structured business transactional data as well as the unstructured search, email and spreadsheet data they have now - even more comprehensive and market-leading search, the world’s biggest and most valuable ML training set, and a business model that could provide these applications for free in exchange for exclusive access to that data and Google being your default search engine. That’s the real existential threat to Oracle; spending all their time trying to win an un-winnable cloud infrastructure war and then GCP coming along and making ERP, CRM and business applications essentially free.
To illustrate this let’s run a query that requests all columns (“SELECT (*) FROM …”) from a BigQuery table, and as you can see from the screenshot below it’s reading through all 1.47GB of table data to return that full set of columns to the user.
If the users’ query only really needed just two of those columns, requesting just those brings down the amount of data read to just under 10MB as most of of that table’s data is stored in columns of data that aren’t needed for the query.
BigQuery historically has supported table partitioning based on the date you loaded data into the table which sounds great until you realise that it’s the transaction date, not the ingest date, that most of your user’s queries filter against.
You could also use table decorators in Legacy SQL to point to the particular day partition your data was stored within but this only went back for a maximum of seven days and required your query tool to support this non-standard feature of SQL; earlier in this year though Google introduced a more flexible form of date partitioning as a beta release feature that allows you to choose the date column your table would be partitioned by, and more recently introduced a feature called table clustering that stores data within a table sorted by the columns you most commonly filter on when running queries against it.
To show how date partitioning and table clustering work, I’ll start by running a query to return just a month’s data from the five years of data held within my table; as you can see in the screenshot below, BigQuery performs a full table scan and reads through all 1.37 GB of data in the table to return just the month of data my query requested.
Standard SQL now supports DDL commands such as CREATE TABLE and CREATE TABLE … AS SELECT, along with a PARTITION BY clause that lets you specify a timestamp or date column to partition the table by. I’ll use these new features to create a copy of that same table, this time partitioned by the timestamp column I’m filtering on in my query predicate …
… and the DDL statement fails. What’s happened there then?
Turns out that BigQuery tables are limited to 2500 partitions for the moment, with any one single load operation limited to 2000 (hence the error) and with partitioning limited to just date and timestamp columns and partitions a day in length it means any one table can only hold around five years worth of data, beyond that you’ll need to create multiple date partitioned tables and UNION them together through a SQL view.
For now though I load my table with just five years of data and then re-run the query that requests a single day from that overall five years; now BigQuery has only read and processed 57 MB of data and it’d be a fraction of that if I only requested the two columns I needed, not all columns from that table.
But what about queries that filter against the other columns in this table? We can’t set up table partitioning on STRING, INTEGER or any other type of column datatype so my original query if re-run against the date partitioned table reads just as much data as it did before.
What we could do is re-create the table with its data pre-sorted by those two particular columns using another new feature called table clustering, so that queries that filter against those columns find the blocks of data they’re looking for faster and can skip completely the ones that don’t.
Table clustering in BigQuery is more analogous to loading regular Oracle tables using data from a pre-sorted file and comes with the same benefits and limitations; in BigQuery’s case it takes care of the pre-sorting and table configuration for you but the same limitations still apply around how you filter the table and what happens when you load more data afterwards.
Let’s set-up a clustered table now that stores its data ordered by the two columns used in the query I ran a few moments ago.
Now when I run a query filtering on those columns against this partitioned, clustered table the amount of data read goes down compared to before, and results are returned a bit faster; if I included the partition key column in the query as well, returning just a few days’ data, it’d be faster still.
Table clustering in BigQuery is nice to have but similar to pre-sorting a direct path load into Oracle database tables, it’ll take a lot of rebuilding and careful querying to get the most benefit from and with the size of most BigQuery tables, I doubt that rebuilding will happen much in-practice.
BI tools such as Looker can make use of table partitioning and clustering in queries straight away as no changes are required in the query SQL you write, everything is handled in the table definition. Where you might want to set up partitioning yourself as a Looker developer is for the persistent derived tables (PDTs) that Looker can create to materialize the results of a view you define in your Looker model to derive a dimension attribute using a subquery, for example to calculate the sequence number of a users’ order for retention analysis for an eCommerce dashboard as shown in the screenshot below.
Looker has for some time come with database-specific settings for particular source database types such those used for Redshift used in the example above, and now supports date partitioning for persistent derived tables through a new partition_keys setting as announced in this recent Looker forum post.
Finally, if you’re interested in how partitioning is developing as a feature within BigQuery, and some of the edge-cases and requests for additional functionality that users of partitioning are asking for, this feature request on the BigQuery issue tracker is useful to read-through.
Extending support to more than 2500 partitions seems to be the most popular request along with allowing integer and string datatype columns to be used for the partition key, but also look out for issues around re-loading data into historic partitions and the cost and work involved in rebuilding large tables to re-cluster its data or apply partitioning for the first time.
I started the Drill to Detail Podcast series back in 2016 in order to talk about some of the innovations coming out of the big data analytics space with the product teams, industry analysts and consultants working together to deliver new-world, cloud-hosted big data analytics platforms.
A year-and-a-half later we’re at Episode 50 and to mark the occasion Stewart Bryson, our first guest on the show back at Episode 1 and a regular on the show since, comes back on and is joined by Alex Gorbachev to talk about Google Cloud Platform, Looker, FiveTran and SnowflakeDB, with the episode now available for download along with all 49 other episodes on the Drill to Detail Podcast website …
… and as we’re now at the 50th episode of the show, here’s the Top 10 episodes by download statistics as of January 2018 — thanks to everyone who’s come on the show and everyone who’s listened to the show via iTunes and other podcast providers, and congratulations to Maxime Beauchemin:
I’ve been meaning to mention this new product on the blog for some time now, as it’s one of the tools I use almost every day at work and back home on my personal data projects. Google Cloud Dataprep is Google’s new serverless data preparation tool, a new category of ETL tool aimed at analysts and business users looking to load and prepare their own datasets for analysis, rather than developers looking to create industrial-strength ETL routines into corporate data warehouses.
In my case I’ve been using Cloud Dataprep to take the raw event data I’ve been landing into BigQuery from my various IoT, social media and wearable devices and using it to add descriptors, standardise and then join various feeds together so its easier to query using tools such as Google Data Studio and Looker. Based on technology originally from Trifacta and then extended by Google to add BigQuery and GCS as source and target options, it uses Google Cloud Dataflow as the underlying transformation engine and bases pricing on a multiple of the Cloud Dataflow jobs Dataprep ran on (currently 1.16x, though that could change once it comes out of beta) making the cost of typical data prep job just a few pence, at least for my data volumes.
To take an example, I have a set of Google BigQuery tables that receive data via streaming BigQuery inserts sent over by a FluentD server running on a Google Compute Engine VM. The screenshot below shows one of the tables with data that’s come in from the Fitbit health tracker I use, which sends over daily summary numbers each morning for metrics such as active and inactive minutes, calories burnt and steps recorded.
Google Cloud Dataprep presents the data from each table or file source using a spreadsheet-like interface, allowing you to visually built-up a set of sequential steps like the ones below that adjust a time recording in AM/PM format to 24hr clock, and forward-fill missing values for weight readings that were skipped between certain days.
Something I’d like to have seen but isn’t in the product yet is any support for Google’s Natural Language Processing and other Cloud APIs, key value-adds within Google’s GCP platform as I talked about in an earlier blog post, but presumably they’ll begin to get added into the product as Google extend the core Trifacta codebase to leverage more Google Cloud-specific features. That said, what was inherited from Trifacta is a pretty comprehensive set of regular, windowing and data wrangling transformations pulled together into a script, or “recipe”, as shown in the screenshot below.
I can also use Cloud Dataprep to pivot and aggregate data into summary tables, in the example below taking another BigQuery data source containing individual tweets and other social media interactions stored in the same table column and then pivoting it to give me one column per interaction type and counts of those interactions per day.
Then, you can string together preparation steps into a sequence to produce your final output BigQuery table or file, as I’ve done in the screenshot below to produce my final prepared and combined health data table.
Then I finally go back to the BigQuery Web UI and check-out my freshly-prepared table that I can thereafter keep up-to-date with new data by scheduling that same Google Cloud Dataprep recipe to run every night, appending newly-arrived data to that same BigQuery table.
Google Cloud Dataprep is currently in open beta and if you want to give it a try too, it’s accessible from the Google Cloud Platform console within the Big Data group of products.
The idea for the session came from a request from one of the SIG member to come in and talk about why these “new-world” Hadoop and NoSQL technologies were introduced over the years, what problem they solved at the time and how they solved it. The presentation goes through the various evolutions of data warehousing and big data technologies over the years since 2003 and finishes off with a look at the Google Cloud Platform and Oracle Elastic Compute Big Data platforms I’m using in the work I’m doing today.
Thanks again to the UKOUG for hosting the event in London, and to Patrick Hurley and the SIG committees for inviting me to speak.
Qubit’s innovation in the space was to move beyond simple A:B testing and cookie-based personalization to create an event-level, petabyte-scale customer activity data lake running in Google Cloud that enables retailers and other organizations deliver personalize offers and site features based on a much more granular understanding of customer behavior and preferences, the same vision in fact that I kept putting forward in my webinars and presentations last year around data reservoirs and customer 360-degree analysis.
And of course, this centralized, event-level store of customer activity and purchase preferences creates a fantastic platform on which to build predictive models, real-time next-best-offer decision engines…
… and enable real-time big data analytics — with Qubit’s product in this area, Live Tap, being what I’ve been working on since last year’s Openworld advising their Product Management team and working under Paul Rodwick, who some of you might know from his time as head of Oracle’s BI Product Development team. I’ll write about Live Tap and the work I’m doing there, and my experiences creating an analytics product on top of BigQuery at a later date, as well as our use of Looker to create a semantic model over Qubit’s event-level data lake.
The podcast episode with Alex Olivier talks about Qubit’s journey from initially using Amazon AWS to land and process data using S3 buckets and MapReduce, then moving it all onsite to a cluster of thousands of HBase region servers storing data ingested and processed using Storm with latency down to four hours, to their current setup using Google BigQuery, Google PubSub and Google Cloud Dataflow processing 100,000 events per second and making it all available to customers with latency around 5 seconds — seriously impressive and a great case study around the use of cloud-hosted, elastically-provisioned big data analytics platforms.