Building up a semantic layer with dbt Metrics, Cube and Droughty
I want to live in a world where data platform technologies can coexist together. And if they won’t do it willingly, I’m happy to force their hand a little :) That’s the situation I’m in when it comes to the semantic layer where I want dbt Labs and Cube to just be the BFF they’re meant to be.
The analytics engineering world has been abuzz with the semantic layer in the past year. Beyond the hype, what is its purpose and how do you implement it? Let’s go through a real-world example using the discursus project as a playground.
Gartner defines semantic data models as:
A method of organizing data that reflects the basic meaning of data items and the relationships among them.
(source: Gartner glossary)
The components of a semantic layer are:
Domain-specific entities, such as users and transactions
Relationships between those entities
Entity attributes, that can loosely be classified as either keys, dimensions, and measures/metrics
Descriptions of the entities and their attributes
With those components in place, an end user is able to interact with entities and formulate domain-specific questions that return meaningful answers.
Let’s work this all out through an example.
Architecture
Building up a semantic layer requires that we start architecting those components into our data platforms and having a mechanism to surface them to end users.
Going back to the discursus project, we have 3 key components in our architecture that play a role in serving the semantic layer:
The discursus architecture and its components supporting the semantic layer
dbt: to define entities, attributes, relationships and metrics
Droughty: to generate semantic definitions
Cube: to expose the semantic layer’s components
Defining your semantic layer
At the heart of our semantic layer is dbt. This is where we transform our raw data into a semantic structure that is a reflection of our domain: entities, attributes and relationships.
The discursus domain
The discursus entities, attributes and relationships
dbt is recognizing its role as the driving force in defining data’s semantic meaning. That meaning becomes in itself the API for other tools to interact with the data stored in a warehouse.
dbt started unveiling its semantic strategy with the metric concept. As this is usually the main thing we interact with downstream (e.g. revenue, active users, inventory, etc.), it makes sense for dbt to provide a syntax to define those.
A very simple example using the discursus domain is the following:
version: 2
metrics:
- name: event_count
label: Events count
model: ref('events_fct')
description: "The count of events"
calculation_method: count
expression: event_pk
timestamp: event_date
time_grains: [day, week, month, quarter, year]
dimensions:
- movement_fk
The above example simply tells downstream apps that our data domain includes a metric named event_count which is the count of events by daily, weekly, etc. time grains and which can be sliced by the movement_fk dimension.
That definition sits on top of a model named events_fct. And so our DAG now includes a new metric node.
discursus DAG which includes a metric node
Notice that I said that the metric is defined on top of a model. But really, we should be saying that a metric is an attribute of an entity. In our example above, we would then be saying that event_count is the number of occurrences of the event entity.
And that is exactly what dbt Labs is working on as their next step of deploying the semantic layer. There is an open conversation in the dbt-core git repository about introducing the entity concept in their syntax. How that will manifest is beyond the scope of this post, but it’s a very interesting read for anyone curious about how that syntax is evolving.
Serving your semantic layer
Although dbt is making interesting contributions to the semantic layer, and there is a growing ecosystem of integrations, the semantic layer is fragmented. Different syntaxes and engines are available, but none has won the day yet.
It’s not uncommon for a data platform to have a semantic layer with the following components:
dbt: to define entities, attributes, relationships and field descriptions
Looker (LookML): to provide self-serve BI capabilities for internal users
Cube: to serve data to external-facing apps
dbml: to document and share the latest state of a platform’s data assets
Each of those components has its own syntax. That is a lot of manual work to port your semantic definitions from one tool to another… and it’s prone to errors.
That’s where you need a tool like Droughty to build bridges between your semantic layer components.
Reading from the data warehouse’s schemas, Droughty auto-generates semantic definition files. For example, I can simply run this:
droughty cube - profile-dir ./droughty_profiles.yml - project-dir ./droughty_projects.yml
And it will generate the following Cube definitions:
cube(`events_fct`, {
sql: `select * from analytics.events_fct`,
joins: {
observations_fct: {
relationship: `hasMany`,
sql: `${CUBE.event_pk} = ${observations_fct.event_fk}`,
},
},
dimensions: {
action_geo_country_code: {
sql: `action_geo_country_code`,
type: `string`,
description: `action geo country codes (agcc) are codes assigned to countries of the world by the action geo data company. the codes consist of five numbers and are used to classify countries by geographical region. action geo country codes are used in data analysis related to geospatial applications, including mapping, marketing, and population studies.`,
},
event_date: {
sql: `event_date`,
type: `time`,
description: `event date is the day or days that a particular event is scheduled to take place.`,
},
event_pk: {
primaryKey: true,
type: `string`,
sql: `event_pk`,
description: `an event primary key is a unique identifier associated with an event, such as a conference, corporate gathering, or other special event. the primary key is typically assigned to each attendee and is used to help control event access and track attendance.`,
},
}
});
cube(`events_fct_extended`, {
sql: `select * from ANALYTICS_QA.events_fct`,
extends: events_fct,
measures: { count_of_event_pk: {
sql: `event_pk`,
type: `count`,
},
}
});
Auto-generating semantic definition files accelerate the flow from modifying your semantic layer in dbt and exposing those changes to your semantic engines.
Below is an example of how I can now query my data warehouse with Cube using the semantic elements that were defined in dbt.
Conclusion
As stated in a prior post, dbt is positioning itself to be the data platform standard. The dbt semantic syntax is promising and is worth integrating into data platforms. But that can’t in itself drive an entire data platform’s semantic layer.
We rely on other tools, which are well-established, mature and reliable. That includes Cube which is a great API solution for your data. It has its own syntax to define the semantic meaning of your data, but it shouldn’t be a pain to derive its definitions from what has already been defined through dbt.
Tools like Droughty are necessary when you want to build a data platform that can use what’s best from different ecosystems. Without formal integrations between certain tools, it’s up to us data practitioners to force their hands a little.