Posts tagged Oracle Cloud
Five Thoughts About Thomas Kurian’s Move to become CEO of Google Cloud Platform

The news broke late on Friday that Thomas Kurian is joining Google to become their new CEO for Google Cloud Platform. Five thoughts on Kurian’s move:

  1. It’s a smart move made by an exceptionally smart guy. Brave, as well, given the recent history between Oracle and Google but also not surprising given his drive and presumably point to prove. I met him a few times as part of Oracle’s ACE Director program and he’s the only software exec I know who can talk long-term strategy and vision one minute and then explain to you in-detail how it all works, and doing it all with great humility and courtesy.

  2. The fact that GCP is spoken-of as an also-ran at 10% market share whilst Oracle Cloud gets bundled in with “Next 10” shows what a mountain Oracle have to climb to even become a contender to compete with Microsoft and Amazon in the cloud business - and their insistence on only allowing their SaaS and PaaS apps to run in Oracle Cloud is a worrying parallel with the “Windows Tax” that Microsoft’s Office and Server products teams had to pay back in the Steve Ballmer Era, but with Oracle’s equivalent to Satya Nadella having lost the argument and jumped-ship

  3. But Oracle will survive and this has happened many times before - Ray Lane, Charles Philips, Tom Seibel, Mark Benioff all left and in many cases founded massively successful and competitive businesses, client-server went to internet architecture and then internet went to cloud, it’s all part of how businesses renew and people move on and up, and there’s plenty more smart (and most likely, smarter) people left in Oracle and Larry Ellison is still just as driven, competitive and on-top of his game.

  4. Look out for a very interesting (as in Brexit, interesting to watch but not to be part of) culture clash at GCP, with TK about the most top-down leader of a product development team I’ve ever seen and Google, famously, engineering-focused beanbag-friendly and bottom-up. Add to that the feelings most Googler’s have towards Oracle and TK will have his work cut-out early on to win the various teams over - my guess is that his work-ethic, technical chops and sheer ability will do just that and if this role is a success, Sergey and Larry may well have found their new Eric Schmidt but this time with technical skills to match theirs - but there’s always the chance that culture will prevail and he’ll be the next Marissa Meyer instead. No pressure there then..

  5. Expect to see GCP moving increasingly into SaaS B2B areas such as ERP, CRM, Financials and industry-vertical applications to complement their commoditised IaaS and PaaS cloud business and leveraging their massive GSuite and increasingly ChromeOS install base. Just think what they could do if they had access to all the world’s structured business transactional data as well as the unstructured search, email and spreadsheet data they have now - even more comprehensive and market-leading search, the world’s biggest and most valuable ML training set, and a business model that could provide these applications for free in exchange for exclusive access to that data and Google being your default search engine. That’s the real existential threat to Oracle; spending all their time trying to win an un-winnable cloud infrastructure war and then GCP coming along and making ERP, CRM and business applications essentially free.

Slides and Forbes.com Article from "Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse Cloud"

One of the sessions we delivered at Oracle Openworld 2018 was “Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse Cloud”, a presentation on the agility that Oracle Autonomous Data Warehouse Cloud (ADWC) can provide for teams within big, traditional businesses that would otherwise have to wait months to provision a physical Oracle data warehousing database. Moreover, ADWC instances can scale-up and scale-down as demand grows over time, and hits peaks such as Black Friday that are only short-term but in the past would have required companies to pay for and provision that level capacity all year round even though it sat idle most of the time.

The story behind the presentation was also written-up as a profile article in Forbes magazine by Oracle’s Jeff Erickson, with a great description of how MJR Analytics works on client projects.

To understand the type of agility Rittman is talking about, look at how he works with a client on a data analytics question: “You want to help them get inside their data to see trends and patterns” that they couldn’t otherwise see, he says. That might mean quickly bringing in public data or data from a data-as-a-service company and connecting outside analytics tools. “I work fast,” he says. With a retail customer, for example, “we’ll look to understand the lifetime value of a customer, their cycle of purchases, repeat purchases—all kinds of scenarios. And the whole thing will happen in a day. Much of it collaboratively with the client on a web conference.
— Jeff Erickson, "How A Big Business Can Use An Autonomous Database To Move Like A Startup"

But don’t just take our word for it, or even Forbes Magazine - our objective is for every client engagement to be referenceable where client confidentially allows, and you can check-out our first two testimonials from Colourpop and from Florence on our new Customers page … with a couple more to be added in the next week or so!

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.3

In this series of three blogs on Oracle Analytics Cloud Data Lake Edition I’ve setup an object store data lake in Oracle Cloud using Oracle Big Data Cloud and Oracle Storage Cloud, and ingested streams of real-time event data from IoT and social media sources into Oracle Cloud’s object storage service using Oracle Event Hub Cloud Service.

The event-stream data I staged into Storage Cloud was then copied into parquet files on HDFS and then presented out to BI and ETL tools through Big Data Cloud’s Thrift Server interface, so that now I’m ready, after a short diversion into defining the data engineer role that would typically work with this new product edition, to start exploring some of Oracle Analytics Cloud Data Lake Edition’s new data flow and predictive model preparation features.

The diagram below shows where OAC Data Lake Edition fits into my project architecture, performing the tasks of transforming and enriching the incoming dataset and then presenting my at-scale data out to end-users for analysis using OAC Data Lake Edition’s Data Visualization features.

Looking at the homepage within OAC Data Lake Edition I can see my two Hive tables listed within the dataset catalog, alongside other datasets I’d uploaded directly into OAC. This visual catalog of available datasets is also the new homepage interface that OBIEE12c 12.1.0.4.0 now adopts, with both cloud and on-premises versions of Oracle’s BI tools now relegating the old “Answers” homepage to something you have to dig around and specifically look for in favour of this more self-service Data Visualization starting page.

I’ll have to write an article on Answers and how powerful its interface is, and the full dimensional model it exposes from the Oracle BI Repository, in a blog post sometime in the future as it’s almost in danger of getting forgotten about.

Moving on though, the first transformation I need to do on all the incoming datasets is to take the timestamp column in each table and convert it to a format that OAC recognises as a valid TIMESTAMP datatype format, then convert those columns to TIMESTAMPs so that DV can automatically enable time-series analysis by day, month, quarter, hour and so on. I do that using a feature that’s also present in OAC Standard Edition, the lightweight data preparation interface that’s presented to users when they first add a new data source into OAC’s dataset catalog, shown in the screenshots below.

Where OAC Data Lake Edition gets really interesting right now both in terms of differences vs. the on-premises versions of OBIEE I used to use, and in terms of it’s “data engineering” potential, is with a feature called Data Flows.


Most self-service BI tools now have a basic data loading and data preparation capability today with Tableau Data Prep being one of the latest examples. Designed to handle more complex data prep use-cases than basic datatype changes and field-splitting, they give end-users the ability to do this type of work themselves rather than trying to do it in Excel or handing the work off to the IT department and having to wait days or weeks to get that data back.

Data Flows are a feature that’s been introduced since the original on-premises version of OBIEE12c that I last used when working out in consulting, and provide you with what’s effectively a lightweight, multi-step ETL tool that executes transformations using the BI Server’s Model Extension feature, introduced back when OBIEE12c first came out as the mechanism to enable on-the-fly data mashups between server-side and user-uploaded datasets.

Looking at the transformation operators available in OAC Data Lake Edition v4 there’s quite a few that apply to data lake and data engineering-type workloads including running Python statistical analysis scripts and predictive model training and model build; there’s also an operator for creating an Essbase Cube, with Essbase in this instance positioned as a fast ad-hoc analysis back-end for use with the data visualization part of OAC.

For now though there’s two transformation tasks I want to do with my Hive datasets; first, enrich the incoming social media data by analyzing the sentiment in each tweet and then writing the data plus this sentiment tagging back to the Oracle Big Data Cloud environment, so that I can then turn those sentiment tags into a score and create a data visualization showing who sends me the most tweets and how crazy they are overall.

The second data enrichment I wanted was on some Strava cycling workout data I’d uploaded directly into OAC using the CSV file upload facility; using the model train and build Data Flow operators I defined a model to predict how many “kudos”, the Strava equivalent to Facebook “likes”, I’d get for a given cycle workout with a number of different variables available to the model in order to make the prediction — for example, distance and elevation gain, map location, effort expended and so on.

Then, after running the model build step and looking at the predicted values and the actual ones for the remainder of the dataset not used for model training, you can see the predicted kudos values are fairly in-line with the ones I actually recorded for those rides.


Another feature that’s now in Oracle Analytics Cloud is automated data diagnostics, or Explain. Explain uses machine-learning libraries and that same model extension/XSA BI Server framework to help users quickly understand the value distribution and statistically correlated driving factors for a particular dataset, and learn which segments or cohorts have the highest predictive significance. Enabled by a bunch of extensions to BI Server logical SQL I used the feature first on the sentiment scoring I’d performed earlier on, and then on the steps data I’d brought into Oracle Big Data Cloud from my Fitbit device, after converting the numeric step counts into a text attribute by bucketing its values into low, medium and extreme bucket values.

This is pretty powerful stuff, with automated understanding and context-gaining about new datasets being one of the most user-enabling features I’ve seen arrive recently in BI tools with the best example of this being BeyondCore, now part of Salesforce Einstein. OAC lets the user pick the most useful of the Explain facts and driver insights and publish them to a Data Visualization dashboard like the one below, showing the most predictive and significant variables in my dataset that influence the steps I take each day.

Which leads neatly to the final “data at-scale” feature in OAC, the Data Visualization feature that in my case is querying the ingested, transformed and now enriched datasets I’ve got running on my Oracle Big Data Cloud instance alongside Oracle Event Hub Cloud and Oracle Analytics Cloud Data Lake Edition.


Thank you once again to the Oracle ACE Director program for providing access to Oracle Analytics Cloud Data Lake Edition, Oracle Big Data Cloud and Oracle Event Hub Cloud services over the past few weeks. If you’re looking to try these new services out there’s free trials available for most of Oracle’s Cloud products and many of the new features are also available in Oracle Data Visualization Desktop 12c and Oracle Business Intelligence 12c, both of which can be downloaded for training and evaluation under the OTN license scheme.


Wrapping-up this three part series on Oracle Analytics Cloud Data Lake Edition and Oracle Big Data Cloud I’d like to go back to the two (serious) questions I asked myself at the end of the previous post:

  1. Has OAC Data Lake Edition got anything actually to do with data lakes, and is it a useful tool for aspiring Oracle technology data engineers?
  2. How does it compare to my old favourite Oracle big data product Oracle Big Data Discovery, officially still available and not quite dead yet but existing in some strange zone where the on-premises version stopped getting updates a while ago and the cloud version is for sale but you can’t buy it unless you know the right person to ask and he’s actually gone to Cloudera

So has Oracle Analytics Cloud Data Lake Edition got much to do with actual “data lakes”? Well … it integrates with Oracle Big Data Cloud and apparently comes with an option to run those data flow transformation in Big Data Cloud’s Apache Spark environment, though to be fully-transparent I didn’t see that as an option when doing my evaluation so can’t comment on how well or not that works.

Like Oracle Big Data Discovery before it, OAC Data Lake Edition makes you structure your incoming event stream data with Hive table metadata before you can work with it, but that’s actually fairly standard practice with most data visualization tools that work with Hadoop and data lake environments.

Having Essbase in this product package, alongside the data lake functionality, did make me scratch my head a bit and wonder, “why?” — data lakes and Essbase are about as opposite as you can get in terms of target users and use-cases and I think this Data Lake Edition is as much about creating a product package and price point that’s mid-way between the OAC Standard and Enterprise Edition.

But there is some logic to having Essbase in this edition; it provides a set of easy-to-use loading and preparation tools for Essbase making it easier for customers new to that product to start using it, and Essbase with its latest hybrid ASO/BSO storage format is surprisingly scalable and blindingly-fast to query, a great potential back-end for enabling data analysis “at-scale” using Oracle Analytics Cloud’s data visualization features.

I also get the feeling that this initial v4 version of OAC Data Lake Edition is more of an initial first-cut release to get something out to customers, establish the product package and validate the roadmap and market assumptions. Oracle Analytics Cloud v5 isn’t too far off and I’d expect incremental improvements and new features in areas such as natural language processing and machine learning built-into the developer experience; I wouldn’t be surprised to see Oracle Big Data Preparation Cloud making its way into the product given its obvious fit and overlap with Data Lake Edition’s data prep features.

But where I really see an interesting future for OAC Data Lake Edition is when it starts to integrate product features and the development team from Oracle’s recent acquisition of Sparkline Data.

I came across SNAP, Sparkline‘s platform for building OLAP-style dimensional models over data lakes and cloud object storage layers about a year ago when researching analytics platforms at Qubit and quite frankly, it’s as revolutionary in terms of todays data lake analytics market as OBIEE (or the nQuire Server) was back in 1995 with its virtual data warehouse over application and data warehouse sources.

Take these two slides from the Sparkline website and imagine them as the future of Oracle Analytics Cloud analyzing event-streams and big data cloud-hosted datasets…

and you can see why I’m keen to see where Oracle Analytics Cloud Data Lake Edition goes over the next couple of years. I’m speaking on OAC Data Lake Edition at ODTUG KScope’18 in Orlando in just a couple of weeks time so come along if you’re there, it should be an interesting talk.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.2

A couple of weeks ago I posted a blog on Oracle Analytics Cloud Data Lake Edition and the Oracle Cloud big data products you’d typically use to create a cloud-hosted, object storage data lake. In this and the final post in the series I’m going to look at some of the data transformation and predictive model building features that come with the Data Lake Edition of OAC and extend the base capabilities in this area that OAC Standard Edition comes with, and the data engineering role that uses the data lake this Oracle Analytics Cloud product packaging option is designed for.

As a reminder, OAC Standard Edition is more-or-less Oracle Data Visualization running as a single-tenant cloud-hosted VM (cluster) that’s either customer managed, or Oracle-managed if you go for the new Autonomous Analytics Cloud that interestingly is charged at a lower hourly rate than the customer-managed version; presumably it costs Oracle less to run OAC when it’s doing all the administration rather than letting customers control the settings, which makes sense.

It does make you wonder when all this VM nonsense will go away though as well as the distinction between OAC and the DVCS/BICS managed service cloud analytics products, with all of these services just becoming packaging options for a multitenant service partitioned by customer instance as the likes of Looker and Google Cloud Platform do today.

But for now, Oracle Analytics Cloud Data Lake Edition is described on the product homepage as having the following high-level capabilities compared to OAC Standard Edition:

  • Accelerated analysis via Essbase
  • and support for creating scalable, shareable high performance data flows.

whilst the product documentation states that it includes all the OAC Standard Edition features, plus:

  • Explore and replicate data
  • Create scalable data flows including ingestion, transformation, and persistence
  • Execute data flows by using Apache Spark or Oracle Database Cloud

with two edition-specific scenarios listed:

  • Explore Data Lakes, described as “You can create data flows to visualize data in your data lakes”, and
  • “Replicate Data in Data Lakes”, explained as “You can replicate data from Oracle Service Cloud (RightNow) into Oracle Big Data Cloud Service, Oracle Database Cloud, and Oracle Database. For example, you can replicate data from customer service records in Oracle Service Cloud into your data lake on Oracle Big Data Cloud Service and then perform further data preparation and analysis with data flows and data visualization projects.”

I’m guessing most readers won’t have heard of Oracle Service Cloud but for Oracle Cloud customers it’s quite a useful initial integration, as Service Cloud is Oracle’s cross-channel digital customer call centre application that brings together customer service interactions across web, social media and offline service channels. More interestingly though and you may have missed it back at the start but Data Lake Edition is the edition that ships Essbase, Oracle’s multi-dimensional OLAP server, in what almost seems like an afterthought right now but will become, I suspect, a more significant defining feature of this product package as time goes on.

For the purposes of this series of blog posts though, what I’m really interested in are the data engineering features in OAC Data Lake Edition — but before I start looking into what they might be, I guess I’d better explain what I mean by a data engineer.


I first came across the term data engineer in a blog written by Curt Monash — still the #1 person I’d love to have come on the Drill to Detail Podcast, if you happen to read this Curt — where he wrote:

“I learned some newish terms on my recent trip. They’re meant to solve the problem that “data scientists” used to be folks with remarkably broad skill sets, few of whom actually existed in ideal form. So instead now it is increasingly said that:

- “Data engineers” can code, run clusters, and so on, in support of what’s always been called “data science”. Their knowledge of the math of machine learning/predictive modeling and so on may, however, be limited.

- “Data scientists” can write and run scripts on single nodes; anything more on the engineering side might strain them. But they have no-apologies skills in the areas of modeling/machine learning.”

I think it’s probably true to say that the term really started to enter people’s consciousness, at least the sorts of people I hang around with, when Maxime Beauchemin from Airbnb wrote his seminal blog “The Rise of the Data Engineer” where he introduced this new discipline as:

“Like data scientists, data engineers write code. They’re highly analytical, and are interested in data visualization.

Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science.

In relation to previously existing roles, the data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.

In smaller companies — where no data infrastructure team has yet been formalized — the data engineering role may also cover the workload around setting up and operating the organization’s data infrastructure. This includes tasks like setting up and operating platforms like Hadoop/Hive/HBase, Spark, and the like.”

Maxime’s blog struck a chord with me as that’s pretty much how I’d describe what I do, or at least it’s the work I most enjoy doing; building big data systems to support BI and data warehousing workloads where my knowledge of how Hadoop, cloud and distributed data platforms complements the knowledge that other colleagues have around data science, machine learning and statistical analysis though importantly, I know enough about those areas to be able to deploy the models they build (or as we say over in the UK at least, I know enough about them to be dangerous).

I interviewed Maxime about his blog and the topic of data engineering on the Drill to Detail Podcast last May and if you’ve not done so already I’d recommend you give the episode a listen; for me though, having just spent the last eighteen months working in the London tech startup scene, my definition of a data engineer would be someone who’s just out of university and has an ironic beard, thinks he’s made ETL tools obsolete but is slowly, painfully, recreating one and is now just on the brink of discovering code templating, deployment contexts, change control and data lineage…

… but earns twice what you earn and gets all the girls. Because, data engineers are even more essential than data scientists if you’re serious about building a data lake, and data lakes are the new data warehouses.


Which bring us back to Oracle Analytics Cloud Data Lake Edition and these three blog posts. The first part of this three part series talked about Oracle Big Data Cloud and products such as Oracle Event Hub Cloud — Dedicated (who names these products at Oracle? Seriously?) and in this second I seem to have got rather diverted by trying to define what Oracle Analytics Cloud Data Lake Edition actually is and then even more so by defining the supporting role of a data engineer; in the final post tomorrow I’ll take a look at some of the data transformation, model definition and deployment and data engineering features in OAC Data Lake Edition and then, finally, conclude by asking myself two questions — OK, three:

  1. Has OAC Data Lake Edition got anything actually to do with data lakes, and is it a useful tool for aspiring Oracle technology data engineers?
  2. How does it compare to my old favourite Oracle big data product Oracle Big Data Discovery, officially still available and not quite dead yet but existing in some strange zone where the on-premises version stopped getting updates a while ago and the cloud version is for sale but you can’t buy it unless you know the right person to ask and he’s actually gone to Cloudera
  3. Will the ironic beard and quiff be enough, or do I need to go full ginger if I’m going to really be taken seriously as a data engineer?

Final part in the series on Oracle Analytics Cloud Data Lake Edition comes later this week.

Oracle Big Data Cloud, Event Hub and Analytics Cloud Data Lake Edition pt.1

Some time ago I posted a blog on what analytics and big data development looked like on Google Cloud Platform using Google BigQuery as my data store and Looker as the BI tool, with data sourced from social media, wearable and IoT data sources routed through a Fluentd server running on Google Compute Engine. Overall, the project architecture looked like the diagram below…

… and I’ve got more-or-less the same setup running right now, with an additional GCE VM running Confluent Open Source to feed a subset of the event streams into a Druid cluster that I’m using to test out Looker, Superset and Imply for sub-second ad-hoc query analysis use-cases. More on that soon.

If you’re a regular reader of this blog you might recall a week or so ago I posted a blog on the new releases of Oracle Business Intelligence Enterprise Edition (OBIEE), Oracle Analytics Cloud (OAC) and Oracle Data Visualization Desktop (DVD) and mentioned a new packaging option for their cloud analytics product, Oracle Analytics Cloud Data Lake Edition. I’ve got a particular interest in what this product might be as I used to use the product it replaces, Oracle Big Data Discovery (BDD), fairly extensively in Oracle big data analytics and data lake projects a few years ago.

And Oracle Big Data Discovery was — technically at least — a great product. It combined the search and analytics features of Endeca Information Discovery with the scale and data transformation abilities enabled by Apache Spark and Hadoop, but suffered perhaps by being a bit ahead of the market and by not having any obvious integration with the rest of Oracle’s analytics and data management tools. By contrast Oracle Analytics Cloud Data Lake Edition is one of three packaging options for Oracle Analytics Cloud and includes all of the functionality of OAC Standard Edition (Oracle Data Visualization together with basic data preparation tools) as well as itself being a subset of the wider set of analysis, dashboarding and enterprise data modeling features in OAC Enterprise Edition.

An equivalent product architecture for ingesting, transforming and analyzing my IoT, wearables and social media data in Oracle Cloud would look something like the diagram below, with the following Oracle Cloud Platform-as-a-Service (PaaS) products used for ingest, storage and analysis:

  • Oracle Event Hub Cloud Service: Apache Kafka running either customer or Oracle-managed with full access to Kafka’s REST Proxy and Kafka Connect
  • Oracle Big Data Cloud: Oracle’s new elastically-scalable Hadoop platform running Apache Spark, Ambari and other Hortonworks Data Platform components
  • Oracle Analytics Cloud Data Lake Edition: Oracle Data Visualization combined with more extensive lightweight ETL (“data flow”) components, text analytics and machine learning model training and build capabilities

In this example I’m using Apache Hive and Parquet storage as my column-orientated data platform but of course I’ve now also got Oracle Autonomous Data Warehouse Cloud as an option; I’ll stick with Hive on Oracle Big Data Cloud for now though as this gives me the option to use Apache Spark to transform and wrangle my data and for building machine learning models using SparkML and, via pySpark, Python Pandas. In what’s the first of two posts in this short series I’ll be looking at how the data pipeline is set up, and then in the second post I’ll look at Oracle Analytics Cloud Data Lake Edition in detail focusing on the data transformation, data engineering and data science features it adds beyond OAC Standard Edition.


The development environment I put together for this scenario used the following Oracle Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) components:

  • Oracle Compute Classic and Storage Classic Services
  • Oracle Database Cloud Service, with the 11g database option
  • Oracle Event Hub Cloud Service Dedicated, with Kafka Connect and REST Proxy nodes
  • Oracle Big Data Cloud, single node with Hive, Spark 2.1, Tez, HDFS, Zookeeper, Zeppelin, Pig and Ambari
  • Oracle Analytics Cloud Data Lake Edition with Self Service Data Preparation, Visualisation and Smart Discovery (aka Oracle DV)

The screenshot below from the Compute Classic Service Console shows the various PaaS VMs running on the compute IaaS layer, with other tabs in this console showing network, storage and other infrastructure service usage.

The order in which you install the services is important if you want to associate the various products together correctly, and if like me you’re using a trial account you’ll need to plan carefully to ensure you keep with the various quota limits that Oracle Cloud imposes on trial identity domains that stopped me, for example, allocating the usual two OCPUs to the main Oracle Event Hub Kafka server if I wanted to run the rest of the stack at the same time.

Associating two services together, for example Oracle Event Hub and Oracle Big Data Cloud, connects the two services together automatically in the identity domain network and makes using them together much simpler but unless Event Hub Cloud is provisioned and available when you come to install Big Data Cloud you can’t go back and associate them afterwards, but more annoyingly if you decide you want to use Event Hub Cloud and associate it with Big Data Cloud but since then you’ve provisioned Oracle Analytics Cloud Data Lake Edition and associated Big Data Cloud with that you have to unwind the whole provisioning process and start again with Event Hub Cloud if you want to connect them all together properly. And forget deleting that Database Cloud Service you associated with Oracle Big Data Cloud and then forgot about as you can’t delete services that other services are associated with.

Provisioning each of the services involves giving the service instance a name, assigning storage buckets and OCPU counts to the various cluster nodes you request, and at key points selecting previously provisioned and now running services for association with the one you’re now provisioning. The screenshots below show the three-stage provisioning service for Event Hub Cloud Service — Dedicated:

Provisioning a new Event Cloud Service — Dedicated cluster

and the order in which I provisioned my data lake services, and the important options I chose to make it all work together and within quota, were as follows:

  1. First ensure you have access to Oracle Cloud, and then the various IaaS services: Oracle Cloud Infrastructure Compute Classic, Oracle Cloud Infrastructure Object Storage Classic and Oracle Identity Cloud Service
  2. Provision Oracle Database Cloud Service to store the various Fusion Middleware RCU schemas; in my case I chose Oracle Database 11gR2 as the database type as it avoids the complexity around CDBs and PDBs you get with the 12c database release
  3. Then provision Oracle Event Hub Cloud — Dedicated with one OCPU for the main Kafka VM, one for the REST Proxy VM and another one OCPU for the Kafka Connect VM. For a real deployment you’d want at least two OCPUs for the Kafka service VM but using just the one kept me within my overall OCPU quota limit when installing the rest of the stack
  4. Next step is to provision Big Data Cloud with a single node with the minimum 2 OCPUs, the Full deployment profile and version 2.1 of Apache Spark as that’s the version OAC Data Lake Edition insist on in the next step. When prompted, choose the option to associate Event Hub Cloud and Database Cloud with Big Data Cloud as you won’t get the option to do this again after the initial service provision; once provisioned, open-up the TCP port for Ambari (8080) to the public internet so that OAC in the next step can associate with it — provisioning for OAC failed for me every time until I looked through the provisioning logs and spotted this as the issue
  5. Finally, provision Oracle Analytics Cloud and choose Data Lake Edition as the package option, again in my case assigning a single OCPU and selecting Data Lake Edition as the software version

At that point if you then bring up the Cloud Services dashboard and review the services together for the first time, it’ll look something like this:

Oracle Data Lake stack within Cloud Services Dashboard

Now it’s time to ingest some data and land it into Oracle Big Data Cloud.


The streaming IoT, wearables and social media comms data that I’ll be ingesting into Big Data Cloud will be coming in from the public internet over TCP, and I’ll also want to connect to Event Hub Cloud from my desktop using tools such as Kafka Tool so an addition configuration step I’ll do before setting up anything else is to open-up Event Hub Cloud’s Kafka broker endpoint to the public internet using the Access Rules menu item in the Event Hub Cloud console.

Now I can see the Kafka service and the default topics that Event Hub Service created for me in Kafka tool.

I can either then use Kafka tool to create a new topic to start receiving the first of my data streams, the IoT device event data coming out of Samsung SmartThings, or create the topic as by defining a new Event Hub Cloud service from within the Event Hub Cloud Service — Dedicated console (confusing, but that’s how Kafka topics are named within Event Hub Cloud)

Then it’s just a case of directing the stream of IoT event data to the public Kafka broker endpoint exposed by Event Hub Cloud Service — Dedicated and then, after a short while, checking the metrics for the new Kafka topic that I setup to receive this incoming streaming data.

Getting the data off the Kafka topic and into a Hive table on the Big Data Cloud instance involved the following steps, using Oracle Cloud Infrastructure Object Storage Classic as the intermediate staging layer together with Event Hub Kafka Connect’s OCS Sink Connector:

  1. Configure Event Hub Kafka Connect OCS Sink Connector to push topic events to Oracle Cloud Infrastructure Object Storage Classic (OCS)
  2. Using Zeppelin notebook provided by Big Data Cloud Console, create a CRON job that copies those events across to HDFS storage
  3. Create Hive external tables with location clauses that point to the directories I’ve copied the event files into from OCS

Then, when I go and log into OAC Data Lake Edition and connect to the Hive Thrift Server on the Big Data Cloud instance I can see the Hive tables I’ve just created, and the data that’s now streaming through from my two initial sources via the Kafka service and Kafka Connect running on Event Hub Cloud Service — Dedicated.

In the second half of this two-post series I’ll go deeper into OAC Data Lake Edition and see how its additional transformation and analysis capabilities stack-up against OAC Standard Edition, and also see how it compares to the Oracle Big Data Discovery tool its looking to eventually replace.