Posts tagged Data Science
Graph Analysis in the “How a Tweet Went Viral” Conference Presentation

Earlier this week I presented a session at the BIWA 2017 conference in San Francisco on using Oracle Big Data Spatial & Graph to understand how my WiFi kettle tweet went viral back in October last year, by using graph analysis and data visualization tools like Cytoscape and Tom Sawyer Perspectives.


I’ve uploaded the slides and embedded the demos as a series of Youtube videos within the slides, so you can see the new Timeslice-analysis feature in the Oracle Cytoscape plugin we developed for the presentation along with the mapping and analysis features in Tom Sawyer Perspectives that helped us work out exactly how the story went viral and who helped this happen … as you’ll find out in the slides.

Data Lakes at Google Scale, The End of Meaningless Customer Experiences, and UKOUG Tech’16 in…

The other day I wrote a blog on Google BigQuery and how it enabled petabyte-scale data lake projects along with the customer-360 / digital marketing platforms I talked about last year at various events and webinars.

If you want to see an example of this type of platform in real-life, the video below featuring Qubit’s Alex Olivier presenting at Google’s NEXT London goes through their evolution from on-premise Hadoop to running it all at petabyte-scale today on Google BigQuery and Google Cloud Platform.


I’m working there at the moment doing some product management around analytics, and the video below from the same team is a great example of what machine learning looks like when productised and used at-scale for personalisation, recommendations and opportunity mining.


Separate to this, Ill be at the UKOUG Tech’16 Conference and Exhibition next week in Birmingham talking about big data, IoT, analytics and data visualization on Oracle, cloud and open-source platforms, building on the story I delivered back at Voxxex Bristol earlier in the year …


… and also telling the story for one last time how I ended-up on the front page of the Guardian and Daily Mail, discussed on The Today Programme and as a punchline on Have I Got News For You …


… just because I spent all day trying to voice-control my kettle and feed its data into the Hadoop cluster in my garage, as you do.

See some of you in Birmingham next week, and in Malta for another Oracle event if anyone’s there as well.

New Oracle Magazine Article on Oracle Big Data Spatial & Graph for Social Network Analysis

Back at the beginning of 2016 I presented a session at the BIWA 2016 Conference on using Oracle’s (at that time) new Big Data Spatial and Graph product to do what’s called “social network analysis”; a form of graph analysis using the networks formed by social networks, in this case Twitter, to help understand who’s central to a particular network, who influences who, who are the best connectors through the network to someone you might want to speak to, and so on.

Twitter works well as a network to analyse for these types of presentations as most people are familiar with the tweets, replies, hashtags, retweets and other types of connections users can create with this social network, and Oracle Big Data Spatial and Graph helps you identify the communities that users not only self-declare through hashtags (for example, “#00w16”) but also through the connections they build up by just interacting with the same people over and over again — a graph analysis technique called “clustering” that identifies groups of users based purely on their common networks of communication.

If you’re interested in the topic and want to understand a bit more about how this type of graph analysis works, as well as see how this new Oracle Big Data product looks, the new September/October 2016 edition of Oracle Magazine has a follow-up article by myself on the topic called “Social Network Analysis : Use Oracle Big Data Spatial and Graph to analyze social networks” that uses some Twitter data I provide along with software you can download from OTN if you want to try it out yourself.

Once you’ve worked through the article, be sure to check out the official blog from the product team for more scripts and examples including product recommendations using the personalized page rank algorithm, and fraud detection in finance spotted by identifying circular payment relationships — two very good examples of data analysis that would be very hard to perform using traditional relational data models.

From lots of reports (with some data Analysis) to Massive Data Analysis (With some Reporting)

Slides (and some embedded videos) from my session at the inaugural Danish BI Meetup in Copenhagen, “From lots of reports (with some data Analysis) to Massive Data Analysis (With some Reporting)” — on the future of BI, how Hadoop is the new analytic platform, and how the BI industry is changing with the advent of self-service data discovery tools.


Building Predictive Analytics Models against Wearables, Smart Home and Smartphone App Data — Here’s…

Earlier today I was the guest of Christian Antognini at the Trivadis Tech Event 2016 in Zurich, Switzerland, presenting on the use of Oracle Big Data Discovery as the “data scientists’ toolkit”, using data from my fitness devices, smart home appliances and social media activity as the dataset. The slides are here:


Whilst the main focus of the talk was on the use of BDD for data tidying, data visualisation and machine learning, another part of the talk that people found interesting but that I couldn’t spend much time on was how I got the data into the Hadoop data lake I put together to hold the data. Whilst the main experiment I focused on in the talk just used a simple data export from the Jawbone UP website ( to get my workout, sleep and weight data, I’ve since gone on to bring a much wider set of personal data into the data lake via two main routes;

  • IFTTT (“If this, then that”) to capture activity from UP by Jawbone, Strava, Instagram, Twitter and other smartphone apps which it then sends in the form of JSON documents via HTTP GET web requests to a Logstash ingestion server running back home for eventual loading in HDFS, then Hive via Cloudera’s JSONSerde

  • A Samsung Smart Things “smartapp” that subscribes to smart device and sensor activity in my home, then sends those events as JSON documents again via HTTP GET web requests to Logstash, again for eventual storage into HDFS and Hive and hourly uploads into Big Data Discovery’s DGraph NoSQL engine

This all works well and means that my dashboards and analysis in Big Data Discovery are at most one-hour behind the actual activities that take place in my smartphone apps or in my home amongst the various connected heating thermostats, presence sensors and other devices eventually feeding their activity through my Samsung Smart Things hub and the smartapp that I’ve got running in Samsung’s Smart Things cloud service.

More recently though I’ve got the same activity also loading into Apache Kudu, the new real-time fast analytics storage layer for Hadoop, using a new Hadoop ETL tool called “Streamsets” that streams data in real-time into Kudu and Cloudera Impala from a variety of source types.

This additional data loading and storage route now gives me metrics in actual real-time (not the one-hour delay caused by BDD’s DGraph hourly batch load), and gives me the total and complete set of events and not the representative sample that Big Data Discovery normally brings in via a DGraph data processing CLI load.

Eventually the aim is to connect it all up via Kafka, but for now it’s a lovely sunny day in Zurich, I’m meeting Christian Berg for a beer and lunch in 30 minutes, so I think it’s time for a well-deserved afternoon-off, and then prep for my final speaking commitment before Oracle Openworld next week … Dublin on Monday to talk about Oracle Big Data.