Posts tagged Cloud Computing
Using Google BigQuery, Google Cloud Natural Language API and Looker to Work-Out Exactly How Much…

Back when the incident with my WiFi kettle went viral on Twitter and ended up in the national newspapers, on Radio 4’s The Today Programme and as one of the questions on Have I Got News For You, I scraped the reader comments from the Guardian article that first broke the story along with the Mail Online follow-up article that put a more negative spin on it (“Couldn’t he just flip the switch”, etc).

I thought it’d be interesting to prep and then load all the comments in Google BigQuery using Google’s new Cloud Dataprep tool, then analyze it all using Google’s Natural Language API and some geocoding libraries to find out exactly how much, and at what intensity, Daily Mail readers hated me and what parts of the country I ought to steer clear of for a while, or at least be prepared to make a cup of tea using a kettle that I have to walk over to and switch on.

Visualizing the data that came out of the exercise using Looker, the BI tool I use day-to-day in my role as Product Manager for an analytics service running on Google Cloud Platform, the first thing I spotted was the significant difference in sentiment between Guardian readers (green in the chart below) and Mail Online (red), where you can see the Guardian picking up the story first and comments being typically neutral or positive, whereas the Mail’s readers hated the story almost right from the start and hated it even more the next morning.

As the Mail Online lists the reader’s location alongside each comment I was able to generate latitude and longitude map references for most of them, and mapping them onto a Looker map visualization shows exactly which parts of the UK hated me the most that day.

Clearly The Mail Online’s Scottish readers typically weren’t impressed with this English innovation, and nor were the Mail’s ex-pat readers down-under who seemingly felt the same way.

Once the Mail Online readers’ woke-up to the story in the United States they had a whole new level of disgust with that geek and his rubbish WiFi kettle.

I was about to blame Piers Morgan for this obvious change of attitude towards us Brits but then spotted, reassuringly, that they seemed to hate the Mail Online even more — all is apparently not lost with our American cousins.

So back to the UK. At postcode area level, then, which parts of the UK are a no-go area for a “Geek [who] buys himself a WiFi kettle then spends 11 HOURS trying to make it boil … to work alongside other devices in his £550,000 home.” — nice touch from the Mail journalist there, my wife was actually more annoyed they quoted the price we paid for it several years ago whereas I was hoping they’d bring in that other Daily Mail obsession about whether something does, or does not give you cancer through in this case the kettle’s WiFi base-station. Anyway, Hampshire’s a no-go area for me now, along with parts of Wales and the West Midlands although I’m safe in North Norfolk and Humberside at least for now.

If it all gets too much for me in the UK I can at least take refuge in Hong Kong, a place where they obviously appreciate innovative kitchen appliances integrated with other appliances through python scripts that work until the first device gets a new IP address when you unplug and then plug it in again to move it across the room.

Though I need to get there via some means other than stowing away on a container ship bound for Shanghai from the ports at Gateshead and Southampton, though they’re still a better option than hitching a ride on a trawler from Blackpool, the official centre-of-hate for this story in the UK.

Mail Online readers from Blackpool’s dislike for the story was surpassed only by the intensity of that sentiment, another metric that I could get out of the Google Cloud Platform NLP API that helps you understand the strength of the feeling someone had when expressing an opinion, not just the words they used.

So how did this all look when I swapped out Mail Online readers for the five thousand or so Twitter users that retweeted, commented on or otherwise helped spread the story around the Internet those two days back in October 2016? The story started off from the UK, obviously, but with a negative spin that over time, as it travelled around the world became more positive…

… although the Australians clearly still weren’t impressed.

If you’re in London this Tuesday evening and interested to know more about the work I do with BigQuery and Looker in my Product Management role at Qubit, come along to the Conversion Rate, Analytics and Product Unconference Meetup where there’s still a few spaces free at the time of writing and I’m one of the three speakers that evening.

What BI Development Looks like with BigQuery, Google Cloud APIs, Looker and Fluentd (courtesy of…

For the last six months I’ve been working in London helping build out an analytics service built on Google’s Cloud Platform, BigQuery and the Looker BI tool. It’s been a very interesting and rewarding experience working within product management in an engineering team using agile development practices, developing with node.js and clojure and shipping releases every few months.

However, working in product management rather than in development or consulting means I’m more likely to be creating product roadmaps and working on partner enablement than developing using BigQuery and Looker, so I try and use the journey time up to London each day to do some development myself, work out what development requests and users stories would be easy or hard to have our engineering team deliver … and because as a techie and career-long BI and data warehousing developer this stuff is just too cool and too interesting to not get developing with.

So over the past few months I’ve been taking the IoT and wearables data analysis system I put together using Hadoop running on servers back home and ported it to Google BigQuery and more recently Looker running in the cloud, using Fluentd as the log aggregation engine running on a Google Compute Engine-hosted VM to collect, process and ship to BigQuery all the data from wearable devices, social media and other services along with IoT event data from SmartThings that I previously fed into logstash and Hadoop. Adding Google Cloud Storage for staging incoming data extracts from services such as Moves and Google Takeout and running it all in Google Cloud Service, this updated version of my analytics architecture looks like the diagram below.

Typical cost for running a VM hosting a simple Fluentd server and running all month is around $80. BigQuery is more or less free for developer use with charging based largely on how much you query with some costs for streaming inserts and data storage, but with the first 1TB of queries free each month and minimal storage needs, Google’s cloud data warehouse platform has pretty-much taken over from Oracle as my day-to-day development platform because, like Oracle did in the days of OTN downloads of on-premise developer tools and databases with permissive licensing for personal training and developing prototype applications, you can learn the technology essentially for free and really get to know and understand the tools without worrying about short trial license periods or having to pay thousands of dollars for full commercial licenses.

I’ve written and presented on Google BigQuery a few times since being introduced to it late last year so I won’t go into how this distributed, column-store query engine based on Google’s Dremel engine works, but there’s another technology I used as part of this round of development, an open-source technology called Fluentd that I used instead of Flume and Logstash for the log aggregation part of this project that I’ve found particularly useful given its range of input and output data source plugins, and its support for PubSub, BigQuery and other parts of the Google Cloud Platform making it easy to accept data from IFTTT and SmartThings and then stream it, lightly transformed and consistently timestamped, into a set of BigQuery tables.

For example, creating a data feed that brought in metrics on car journeys via the Dash IoS app and a small bluetooth device that plugs into your car’s diagnostics port involves first registering an IFTTT applet to trigger when that app registers a journey complete event, with the applet payload then being an IFTTT Maker webhook that sends the journey metrics to Fluentd in JSON format as an HTTP POST request.

Fluentd then receives the JSON document using its HTTP Input plugin, processes, transforms and timestamps the record and then sends it as a streaming insert into a one of the tables I’ve setup in Google BigQuery, over which I’ve then created a SQL view that transforms all of IFTTT’s odd “3h 5m” and “August 14, 2004 at 10:15PM” date and time formats into additional timestamps that Looker and other BI tools can automatically parse into date and time elements.

select date_time, 
timestamp_trunc(date_time,DAY) as date_day,
PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(StartedAt,', ',' '),' at','')) as journey_start_date_time,
PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(EndedAt,', ',' '),' at','')) as journey_end_date_time,
TIMESTAMP_DIFF(PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(EndedAt,', ',' '),' at','')),
PARSE_TIMESTAMP('%B %d %Y %I:%M%p',REPLACE(REPLACE(StartedAt,', ',' '),' at','')), MINUTE) AS journey_mins,
cast(replace(DistanceDrivenWithLabel,' Km','') as FLOAT64) as distance_km,
cast(replace(FuelConsumedWithLabel,' l','') as FLOAT64) as fuel_l,
cast(replace(AvgDistancePerHourWithLabel,' kph','') as FLOAT64) as speed_kph,
cast(replace(AvgFuelConsumptionWithLabel,' kpl','') as FLOAT64) as fuel_econ_kpl,
from `xxxxx.personal_metrics.fluentd_dash_journeys`

One of the nice things you can do with Fluentd is call arbitrary ruby scripts to enrich incoming data via the Fluentd script plugin, something I’ve used to analyze for sentiment and key entities all incoming tweets, emails, article saves and other social media and other communications using Google’s new Natural Language Processing API. To do this I pass all incoming records of these types through a filter definition in the Fluentd config file, like this…

<filter reformed.socialmedia>
type script
path /home/mark/gcp_nlp/nlp_process.rb

…with that script then using the Ruby Cloud Natural Language Client to calculate the sentiment score and magnitude of that sentiment, extract the entities from the incoming text body and then pass these as new Fluentd record keys back to the output plugin to write as additional columns into the BigQuery table I used to store all these communication events.

def configure(conf)
require "google/cloud/language"
def start
def shutdown
def filter (tag, time, record)
require "google/cloud/language"
project_id = "xxxxxxxx"
language = project: project_id
text = record["post_body"]
document = language.document text
sentiment = document.sentiment
entities = document.entities
record["sentimentscore"] = "#{sentiment.score}"
record["sentimentmagnitude"] = "#{sentiment.magnitude}"
entitylist = ""
entities.each do |entity|
entitylist = entitylist + "#{} "
record["entityname"] = "#{entitylist.strip}"

Fluentd can also collect the device and sensor events from my SmartThings hub using a groovy “smartapp” running in Samsung’s SmartThings backend cloud service. All that’s left to do is join the incoming hub events to a lookup table that classifies the IoT events as sensors, switches, colour settings and other event types and groups them by room and you’ve got a useful set of home enviroment and activity data to join up with the other dataset.

Other services such as Moves have APIs that I connect to every few hours using clients such as this one also on Github that runs on the same Google Compute Engine VM that runs the FluentD log collector, with new movement data copied across to a Google Cloud Storage bucket and inserted every few hours into another BigQuery table ready for analysis.

I’ve also been collecting weight readings and sleep readings but these can be sporadic, with weight collected via a Withings WiFi scale but often only every few days, whilst sleep time in minutes should be recorded every day but sometimes I forget to wear my health band, so I used BigQuery SQL’s analytic windowing functions to calculate my average weight over 7 days (a good approach anyway as your weight can fluctuate day-to-day), and over three days for sleep.

AVG(weightKg) OVER(ORDER BY ts
SELECT TIMESTAMP_TRUNC(d.date_time,DAY) as date_time, avg(w.weightKg) as weightKg, avg(w.FatMassKg) as FatMassKg, UNIX_SECONDS(TIMESTAMP_TRUNC(d.date_time,DAY)) AS ts
FROM `aerial-vehicle-148023.personal_metrics.date_dim` d
LEFT OUTER JOIN `xxxx.personal_metrics.fluentd_weighings` w
ON d.date_time = TIMESTAMP_TRUNC(w.date_time,DAY)
GROUP BY d.date_time, ts
order by date_time asc

These two views are then left-outer joined to the main health daily view so that I’ve got weight and sleep readings every day alongside step count, calories burned and other metrics from my Jawbone UP health band.

So where this leaves me now is at the point where I’ve got a pretty interesting dataset running in a Google BigQuery cloud-hosted column-store data warehouse, with the data updated in real-time and enriched and transformed ready for analysis.

For now though the thing that’s made this all really interesting is joining it all up and then analyzing it as one big dataset using Looker, another tool I talked about recently on this blog and one which re-introduces an old concept familiar to BI veterans, the “semantic model”, to this new world of distributed, cloud-based big data analytics.

Looker the company and Looker the BI tool are both attracting a lot of attention right now after Google’s recent investment, and the general buzz around big data analytics running on Google and Amazon’s cloud platforms. I’ve been working with Looker now for around six months as the preferred BI tool for the platform I’m working with, and Looker’s philosophy towards analytics together with the parallels I can see between the enterprise semantic models I used to create working with tools such as Oracle BI and Looker’s updated take on that idea led me to sign-up as a developer partner with Looker in order to get to understand their platform and technology even closer, the same thought process I went through just over 10 years ago when Oracle acquired Siebel and introduced me to Siebel Analytics and what became the focus of my career and the subject of two books, what in-time became Oracle Business Intelligence.

To be clear and with the benefit of having worked with BI tools for almost 20 years Looker the BI tool is still very primitive in many ways — in terms of data visualization options and general UI and usability it’s frankly, pretty basic and it’s quirky use of terminology such as “looks” (for reports), “explores” (for subject areas) and “views” (for tables) and many activities we’re used to being simple and intuitive in existing BI tools such as creating and saving catalogs of reports are non-obvious to end-users or just plain missing … but crucially, all of these missing features or UI quirks can be iterated on in a SaaS product and instead, Looker have got two things right:

  • Their SQL query generation engine was designed from day one to work with distributed query engines like BigQuery
  • They’ve taken the concept of a semantic model, data model abstraction and combining enterprise datasets but crucially, done it in a way that software engineers and data analysts working in their target market can relate to

Looker was designed from the start to query the underlying data source directly rather create a mid-tier cache and query that instead of the underlying data source; Tools like Tableau and Qlikview did well for many years running their user queries of a local cache on the users’ desktop and gave those users much faster response times than they were used to with old, enterprise BI tools that tried to speed up queries by creating layers of aggregate tables, but this approach just isn’t practical with massive, constantly-updated big data sets

Looker also generates SQL queries aware of the way platforms like BigQuery charge for their use — by the amount of data queried rather than server processors or named users. BigQuery is a column-store database that brings back data only for those columns you ask for, but tools like Tableau routinely request all columns for a table making them impractical to use for these types of data stores; Looker also aggressively elimates joins from queries and supports the nested column approach preferable for distributed data stores like BigQuery where joins are much more expensive to process than we’re used to with relational databases

Finally, Looker has introduced a concept very familiar to traditional BI developers but new to most developers working on big data analytics projects, the concept of a semantic model, metadata layers and data abstraction; importantly though, the approach they’ve taken in LookML is designed to appeal to the actual developers building out these types of systems today — software engineers working on web and big data projects who write code, use modern development lifecycles and version everything in distributed version control system such as Git.

So to take the dataset I put together in BigQuery, the first step in the development process with Looker is to create what’s termed “views” for each of the underlying BigQuery table or view sources; in the example below you can see an (abbreviated, there’s more measure definitions in the full view definition file) definition of the “health stats” LookML view, and things to note in this modelling approach are (1) it’s all code, there’s no GUI for developing this all graphically, which suits software engineers just fine as it’s easier from day one to code, version and diff (compare) with other code releases; (2) the dimension_group we define at the start takes care of extracting all the different time periods out of a BigQuery timestamp datatype, and (3) the measure type definitions of sum, average and so on are there to give Looker more flexibility in aggregating (and symetrically aggregating, what we used to call fan-trap avoidance in the old days) measure values.

view: v_health_stats {
sql_table_name: rittman_dw.v_health_stats ;;
dimension_group: date_day {
type: time
hidden: yes
timeframes: [
sql: ${TABLE}.date_day ;;
dimension: pk {
type: string
hidden: yes
primary_key: yes
sql: ${TABLE}.pk ;;
measure: total_active_mins {
type: sum
sql: ${TABLE}.active_mins ;;
value_format_name: decimal_2
measure: avg_sleep_mins_avg_3_days {
type: average
label: "Avg Sleep in Mins"
sql: ${TABLE}.sleep_mins_avg_3_days ;;
value_format_name: decimal_2
measure: avg_weightKg_avg_7_days {
type: average
label: "Avg WeightKg"
sql: ${TABLE}.weightKg_avg_7_days ;;
value_format_name: decimal_2
measure: total_distance_km {
type: sum
sql: ${TABLE}.distance_km ;;
value_format_name: decimal_2

Once you’ve defined all your views, the next step is to define your model made up of what are termed “explores” — think of models as the equivalent of a universe in Business Objects or a repository in Oracle Business Intelligence, and explores as subject areas linking together through sets of views that join on common columns.

connection: "rittman_bigquery"
# include all the views
include: "*.view"
# include all the dashboards
include: "*.dashboard"
label: "Rittman DW"
explore: fluentd_log {
label: "Rittman BigQuery DW"
view_label: "1 - Home Metrics"
join: v_communications {
type: left_outer
view_label: "4 - Comms & Social Media Metrics"
sql_on: ${fluentd_log.date_date} = ${v_communications.date_date} ;;
relationship: many_to_one
fields: [v_communications.post_author,v_communications.post_body...]
join: v_health_stats {
type: left_outer
view_label: "2 - Health Metrics"
sql_on: ${fluentd_log.date_date} = ${v_health_stats.date_date} ;;
relationship: many_to_one
fields: [v_health_stats.total_steps, v_health_stats.total_distance_km...]
join: v_car_trips {
type: left_outer
view_label: "6 - Car Journey Metrics"
sql_on: ${v_car_trips.date_date} = ${fluentd_log.date_date} ;;
relationship: many_to_one
fields: [v_car_trips.fuel_econ_kpl...n]

Truth be told, the net effect of all this isn’t conceptually much different to the enterprise semantic layers I’ve built for Oracle Business Intelligence and other similar systems over the years, and LookML is in many ways really just today’s implementation of what nQuire did back in the late 90’s with Logical SQL and their logical and physical abstraction layers over physical data sources; but again, crucially, Looker’s LookML metadata layer was designed primarily for use through scripting and modern software development practices…

and aligns perfectly with the development lifecycle and tooling used by the types of millenial developers who wouldn’t be seen dead working with the sort of development lifecycle I used to present about a few years ago for an earlier generation of BI tools.

What this gives me once all the development work is complete and data is ready to be analyzed in looks and dashboards is something that looks like the screenshot below; note the semantic model on the left-hand side with subject areas corresponding to the BigQuery views but with further grouping, friendly labelling and organizing into measures and dimension hierarchies and attributes.

As Looker’s SQL queries the underlying BigQuery data store directly rather than an aggregated subset of that data, I can query right down to the lowest detailed-level events in the BigQuery dataset with new events streaming in all the time. Using Looker’s set of tabular, chart and mapping visualization options I can map out where I’ve been over period of time, what I did and correlate that with other data in the dataset, each view within the Looker explore joined together by date so it becomes one big dataset for analysis, as I tweeted about one evening last week when it all came together.


Now it’s all running there’s more I’d like to do with the various Google Cloud Platform APIs including using their new Cloud Vision API to classify and make searchable all the family photos I’ve collected over the years, and there’s lots of Looker features and capabilities I’ve not yet explored in detail including Looker Actions, a concept very familiar from my days working with Oracle BI and its Action Framework. For now thought I thought it would be interesting to share my thoughts on this new platform and the development process from the perspective of a long-term traditional BI+DW developer and architect, feel free to add comments or ask questions if you’re thinking of starting out with BigQuery, Looker or Fluentd in the future.

Slides from My “New World Hadoop Architectures (& What Problems They Really Solve) for Oracle DBAs”…

If you enjoyed my post the other day about BigQuery, Looker and Big Data’s Rediscovery of Data Warehousing … at Google Scale I’ve uploaded the slides on a talk I gave today on this topic at the UKOUG Database and RAC SIG event in London.


The idea for the session came from a request from one of the SIG member to come in and talk about why these “new-world” Hadoop and NoSQL technologies were introduced over the years, what problem they solved at the time and how they solved it. The presentation goes through the various evolutions of data warehousing and big data technologies over the years since 2003 and finishes off with a look at the Google Cloud Platform and Oracle Elastic Compute Big Data platforms I’m using in the work I’m doing today.

Thanks again to the UKOUG for hosting the event in London, and to Patrick Hurley and the SIG committees for inviting me to speak.

Qubit’s Journey to Petabyte-Scale Machine-Learning and Analytics on Google Cloud Platform … and…

My guest on this week’s Drill to Detail Podcast is Alex Olivier from Qubit, a startup based in London founded by four ex-Googlers that uses big data technology and machine learning to deliver personalized experiences and product recommendations to customers of some of the biggest names in e-commerce, travel and online gaming.

Qubit’s innovation in the space was to move beyond simple A:B testing and cookie-based personalization to create an event-level, petabyte-scale customer activity data lake running in Google Cloud that enables retailers and other organizations deliver personalize offers and site features based on a much more granular understanding of customer behavior and preferences, the same vision in fact that I kept putting forward in my webinars and presentations last year around data reservoirs and customer 360-degree analysis.

And of course, this centralized, event-level store of customer activity and purchase preferences creates a fantastic platform on which to build predictive models, real-time next-best-offer decision engines…

… and enable real-time big data analytics — with Qubit’s product in this area, Live Tap, being what I’ve been working on since last year’s Openworld advising their Product Management team and working under Paul Rodwick, who some of you might know from his time as head of Oracle’s BI Product Development team. I’ll write about Live Tap and the work I’m doing there, and my experiences creating an analytics product on top of BigQuery at a later date, as well as our use of Looker to create a semantic model over Qubit’s event-level data lake.

The podcast episode with Alex Olivier talks about Qubit’s journey from initially using Amazon AWS to land and process data using S3 buckets and MapReduce, then moving it all onsite to a cluster of thousands of HBase region servers storing data ingested and processed using Storm with latency down to four hours, to their current setup using Google BigQuery, Google PubSub and Google Cloud Dataflow processing 100,000 events per second and making it all available to customers with latency around 5 seconds — seriously impressive and a great case study around the use of cloud-hosted, elastically-provisioned big data analytics platforms.

There’s also a video of Alex presenting on Qubit’s architecture at last year’s Google NEXT event in London on Youtube, and this episode along with all the others is also available for download on the iTunes Podcast Directory where you can subscribe for free and automatically download new episodes as they become available Tuesday of every week.

Oracle’s Big Data Platform goes Cloud, Becomes Elastic and Suddenly Looks Very Interesting.

Oracle held their annual Openworld customer event last week in San Francisco, and amongst all the other cloud services, infrastructure and other new products Oracle announced over the various keynotes, for me the most interesting thing was the Thomas Kurian cloud innovation keynote on Tuesday where Jeff Pollock came on to demonstrate Oracle’s new Elastic Big Data & Streaming platform, around 30 mins into the video below:

Those of you who follow what cloud platform vendors such as Microsoft and Amazon are doing to provide Hadoop and analytics services (HD Insight for Microsoft Azure, Elastic MapReduce for Amazon AWS) will recognise the similar approach Oracle are taking to the base Infrastructure-as-a-Service layer; as Thomas Kurian shows below before Jeff comes on, familiar Hadoop components (Spark, Hive, ML, Spark SQL) run through YARN and store data in an object store (think S3) running in an elastic cloud environment. What’s new for Oracle customers is consuming Hadoop and data lake functionality as services managed and exposed through APIs, a more abstracted approach that makes it possible to elastically scale the service up-and-down and offer lower initial price-points to startups and kick-the-tires customers creating data lakes in the cloud without any formal IT approval.

Note also Hortonworks in the diagram rather than Cloudera, Oracle’s Hadoop platform partner up until now. This could be because Hortonworks’ HDP platform suits being turned into services better than Cloudera’s, or they offered a better deal for what I’d imagine is mostly open-source products with Hortonworks committers and associated support — or more probably because Cloudera are moving increasingly into cloud services themselves, and seem to be aligning with Amazon and Microsoft’s cloud platforms rather than Oracle’s, as I noted in a tweet earlier today:


As I said, offering Hadoop as services on an elastic cloud platform is nothing new, just tables stakes if Oracle wants to be in this game (and they do). Where things then got interesting is when the demo and the tooling came out. First off was a packaged industry IoT application using predictive analytics to tell the customer which device or sensor is likely to fail before it actually does, giving them the chance to repair or replace the component before there’s a service outage. So far, standard stuff though the execution and UX looked slick, something Oracle are increasingly good at if what you last saw was Oracle Forms and Java applets in the browser.

Then we got on to the underlying services that provide data and insights for the IoT packaged app. You’ve probably seen articles and blogs I’ve written on Oracle’s Big Data Preparation Cloud Service tool, but Data Flow Machine Learning is Oracle’s long-anticipated cloud data integration service that uses dataflow-style mappings, Spark as the execution engine and machine-learning smarts and semantic discovery to automate as much of the development process as possible — this paper back from Spark Summit 2015 explains some of the supervised and semi-supervised machine learning algorithms that Oracle (via their Silver Creek Systems acquisition) used when creating their cloud data prep service.

Then the demo moved on to setting up data ingestion services using Kafka and Spark services running real-time and streaming, and then showed target schemas being defined and created automatically for the user.

And then it got on to Data Flow Machine Learning…

… and that’s worthy of a blog post all of it’s own given how significant it’s going to be going forward, and I’ll post that in a day or so’s time.