Google Cloud Dataprep: Spreadsheet-Style Data Wrangling Powered by Google Cloud Dataflow

Mark Rittman

I’ve been meaning to mention this new product on the blog for some time now, as it’s one of the tools I use almost every day at work and back home on my personal data projects. Google Cloud Dataprep is Google’s new serverless data preparation tool, a new category of ETL tool aimed at analysts and business users looking to load and prepare their own datasets for analysis, rather than developers looking to create industrial-strength ETL routines into corporate data warehouses.

In my case I’ve been using Cloud Dataprep to take the raw event data I’ve been landing into BigQuery from my various IoT, social media and wearable devices and using it to add descriptors, standardise and then join various feeds together so its easier to query using tools such as Google Data Studio and Looker. Based on technology originally from Trifacta and then extended by Google to add BigQuery and GCS as source and target options, it uses Google Cloud Dataflow as the underlying transformation engine and bases pricing on a multiple of the Cloud Dataflow jobs Dataprep ran on (currently 1.16x, though that could change once it comes out of beta) making the cost of typical data prep job just a few pence, at least for my data volumes.

To take an example, I have a set of Google BigQuery tables that receive data via streaming BigQuery inserts sent over by a FluentD server running on a Google Compute Engine VM. The screenshot below shows one of the tables with data that’s come in from the Fitbit health tracker I use, which sends over daily summary numbers each morning for metrics such as active and inactive minutes, calories burnt and steps recorded.

Google Cloud Dataprep presents the data from each table or file source using a spreadsheet-like interface, allowing you to visually built-up a set of sequential steps like the ones below that adjust a time recording in AM/PM format to 24hr clock, and forward-fill missing values for weight readings that were skipped between certain days.

Something I’d like to have seen but isn’t in the product yet is any support for Google’s Natural Language Processing and other Cloud APIs, key value-adds within Google’s GCP platform as I talked about in an earlier blog post, but presumably they’ll begin to get added into the product as Google extend the core Trifacta codebase to leverage more Google Cloud-specific features. That said, what was inherited from Trifacta is a pretty comprehensive set of regular, windowing and data wrangling transformations pulled together into a script, or “recipe”, as shown in the screenshot below.

I can also use Cloud Dataprep to pivot and aggregate data into summary tables, in the example below taking another BigQuery data source containing individual tweets and other social media interactions stored in the same table column and then pivoting it to give me one column per interaction type and counts of those interactions per day.

Then, you can string together preparation steps into a sequence to produce your final output BigQuery table or file, as I’ve done in the screenshot below to produce my final prepared and combined health data table.

Then I finally go back to the BigQuery Web UI and check-out my freshly-prepared table that I can thereafter keep up-to-date with new data by scheduling that same Google Cloud Dataprep recipe to run every night, appending newly-arrived data to that same BigQuery table.

Google Cloud Dataprep is currently in open beta and if you want to give it a try too, it’s accessible from the Google Cloud Platform console within the Big Data group of products.