Analyzing the Hacker News Public Dataset using Firebolt Data Warehouse and Looker
Something I’ve been meaning to write about for a while now is Firebolt, a relatively new startup who’ve been busy raising funds and hiring superstars with the aim of building a cloud data warehouse that’s faster, more scalable and cheaper to run than Snowflake data warehouse.
We’ve been working with Firebolt’s technology and implementation team over the past six months implementing Firebolt Data Warehouse for one of our clients and whilst for confidentiality reasons I can’t talk about that particular project, it has given me a fairly detailed understanding of how Firebolt works and what types of use-case and query workload makes sense to use it for.
In full disclosure we’re now a Firebolt Partner but we’re also Snowflake, Google Cloud Platform and Oracle Partners, as well as regular uses of AWS Redshift, AWS Athena, Clickhouse, Postgres and many other database technologies; Firebolt is an interesting technology that aims to be both faster and cheaper than Snowflake and Google BigQuery, so how does it work and how does it stack-up against its more established rivals?
Firebolt is a new data warehousing technology that promises sub-second response times on very large datasets with highly-concurrent workloads — think of it as “Snowflake with Indexes”, or more succinctly — “speed of Clickhouse meets Snowflake architecture”.
Firebolt’s underlying technology platform has its roots in Clickhouse, an open-source high-performance OLAP database system originally developed by Yandex, now Clickhouse, Inc to generate analytical reports in real-time from non-aggregated data that‘s constantly added-to in real-time.
Clickhouse’s vectorized column store technology could scale-up and run distributed queries on multiple nodes but required the user to setup and manage the infrastructure it all ran on; Firebolt hard-forked the open-source Clickhouse code and re-engineered it to become server-less, decoupling storage and compute and added a metadata layer, query planning and service and orchestration layers as shown in the diagram below from Firebolt’s technology whitepaper.
The thing that makes Firebolt fast though is the choices they’ve made around storage technology; cloud data warehouses such as Snowflake and BigQuery used made it possible to store and query effectively limitless amounts of structured and semi-structured data but did so using technologies available at the time that provided scale and flexibility but response times limited by network bandwidth and latency.
Firebolt’s F3 (“Firebolt File Format”) reintroduces our old friend the “sparse index” — a concept very familiar to readers of this blog familiar with multi-dimensional OLAP technologies — to this story, making it possible to access data in the storage layer in a much more granular way than with partitions or micro-partitions, enabling more processing in-memory and less moving of data across slow networks and therefore — in theory, at least — providing lower-latency query response times for end-users.
So how does Firebolt work in-practice? Let’s start by taking the Hacker News public dataset kindly hosted by Google that contains all stories and comments since it launched back in 2006, load it into Firebolt and see how it all works.
The Hacker News public dataset comes in the form of three tables, “stories”, “comments” and “full” with the first two tables being just stories and posts over a limited period of time and the latter containing jobs and other postings for a total of 31m rows and 11GB of storage consumed. I’ve already exported the contents of those tables into an AWS S3 bucket as uncompressed parquet files, one directory per table exported, ready for loading into Firebolt.
Very similar to staging and loading data into Snowflake, you first create an external table that maps onto those files sitting in our S3 bucket, like this:
CREATE EXTERNAL TABLE HN_FULL(title STRING ,url STRING,"text" STRING,dead BOOLEAN,"by" STRING,score INT,"time" INT,"timestamp" TIMESTAMP ,type STRING,id INT,parent INT,descendants INT,ranking INT,deleted BOOLEAN)URL = 's3://ra-hacker-news/full/'CREDENTIALS = ( AWS_KEY_ID = '<aws_key_here>' AWS_SECRET_KEY = '<aws_secret_key_here>')OBJECT_PATTERN = '*.parquet'TYPE = ( PARQUET );
Then you create the table into which you’ll load this externally-staged “raw” data, with this table then storing data in Firebolt’s F3 storage format optimized for low-latency, super-fast queries.
CREATE FACT TABLE HACKER_NEWS_ALL(title STRING ,url STRING,"text" STRING,dead BOOLEAN,author STRING,score INT,"time" INT,"timestamp" TIMESTAMP ,month TIMESTAMP,type STRING,id INT,parent INT,descendants INT,ranking INT,deleted BOOLEAN)PRIMARY INDEX type, month, author;
Note two things about this DDL statement; the addition of “FACT” into the CREATE TABLE command and the PRIMARY INDEX clause at the end.
Firebolt has three types of table you can define:
External tables that map onto external data stages and use whatever storage format (parquet, CSV, TSV, JSON) those files are stored in
Fact tables that are sharded across engine nodes, with each node storing part of the table. You use fact tables for your larger and most granular (transaction) tables, such as the full Hacker News dataset table export we’re working with here
Dimension tables are replicated in each engine node and are designed for smaller data sets that are typically more descriptive of a dimension in the fact table, and are frequently joined with fact tables. When performing joins, the local shard of a fact table on each node is joined with the local copy of the dimension table.
We won’t create a dimension table in this example but we use them extensively in the client project I mentioned earlier in the post.
In-terms of indexes, Firebolt has three types of index you can create to improve query performance:
Primary Indexes, mandatory with fact tables and optional for dimension tables, which physically sort and co-locate data to best align with how its queried, similar to clustering keys and clustered tables in Snowflake and BigQuery; the innovation here is in Firebolt’s use of sparse indexes that enable a much smaller and highly compressed range of data to be read from F3 into the engine cache at query runtime, producing query results faster due to the reduction in disk I/O
Aggregating Indexes, used for pre-calculating and storing the results of aggregate functions such as sum, count distinct and average. Think “materialized view” but using sparse indexes and other Firebolt technology designed to leverage the F3 storage format
Join Indexes, similar in purpose to the bitmap join indexes I used to use with Oracle data warehouses and designed to pre-calculate and store in-memory expensive table joins between fact and dimension tables.
I’ll now create an aggregating index to go with my fact table, and then finally load data from the external table into my fact table ready for end-users to start querying my data.
CREATE AND GENERATE AGGREGATING INDEX HACKER_NEWS_ALL_AGG_IDX ON HACKER_NEWS_ALL(-- Add as many columns as required for the index-- Choose the columns that will be used in your group by statementsmonth,title,author,type,sum(ranking),avg(ranking),count(distinct id),count(distinct author),avg(descendants));INSERT INTOHACKER_NEWS_ALLSELECTtitle ,url ,"text" ,dead ,"by" as author ,score ,"time" ,"timestamp" ,date_trunc('month',"timestamp") as month ,type ,id ,parent ,descendants ,ranking ,deletedFROMHN_FULL;
After all data has been loaded into the Firebolt fact table, setting up Looker to work with Firebolt is pretty straightforward and pretty-much the same process as when working with Snowflake, Redshift and BigQuery with a few exceptions at this point in time (for example, no support yet for symmetric aggregates) — see the Looker docs for the latest feature support details.
Once my LookML views, model and explore are up and running its then just a case of creating a simple dashboard to see how fast it all runs.
And how fast did it all run? Queries generally took a second or two to respond with the dashboard feeling responsive and no significant delay in any tile returning data.
As a comparison I created the same dashboard and LookML views, model and explore to run against the Google BigQuery dataset that I exported the data from originally, used each dashboard to run the same set of filter actions, dimension member selections and interactions and built a third dashboard that used the Looker System Activity explore to show response times of the two dashboards — I’ve made the Looker git repo containing these dashboards and Firebolt content available as a public git repo on our Github account.
So Firebolt was roughly 2x to 3x faster to return data than BigQuery in this very basic and rather unscientific comparison, and halving or more the response time of just this simple dashboard with a relatively small dataset and a single user isn’t something to be sniffed at.
But where Firebolt really performs, as we’ve seen on the client project we’re currently implementing it for, is with datasets of terabyte and petabyte scale with lots of concurrent users and schemas containing multiple fact and dimension tables joined together as a dimensional model, or as a single fact table along with aggregating indexes designed to provide instant insights for embedded analytics within SaaS applications.
For more insights on how we’ve used Firebolt keep an eye on this blog for further articles, and if you read this in time we’re also speaking at the Budapest dbt Meetup in a couple of weeks time on using Firebolt with dbt — the event should be recorded and available later online and we’ll post details on how to watch after the event.
Interested? Find out More
Rittman Analytics is a boutique analytics consultancy specializing in the modern data stack who can help you get started with Firebolt and other cloud data warehouse platforms, centralise your data sources and enable your end-users and data team with best practices and a modern analytics workflow.
If you’re looking for some help and assistance trialling and implementing Firebolt and building-out your analytics capabilities on a modern, flexible and modular data stack, contact us now to organize a 100%-free, no-obligation call — we’d love to hear from you!