Dynamic Data Model Definition in Cube using Python and Jinja

At Rittman Analytics we’re big fans of Cube, pioneer of the “Headless BI” movement and used in many of our client projects including the subject of one of our most recent case studies, Osprey Charging in the UK.

Cube now supports the use of Python and Jinja for defining dynamic data models, a feature introduced as part of Cube’s improved integration with dbt Core via the new cube_dbt package that simplifies the process of defining the Cube semantic layer on top of dbt models by enabling developers to load metadata from a dbt project’s manifest.json file.

 
 

Even more interestingly this new ability to declare and invoke Python functions from within a Jinja template gives us a means to de-duplicate common patterns in our data models and the dynamic generation of data models from a remote data source or conditional logic.

Dynamic Generation of Data Models

Consider a situation where you have a multi-tenant Cube setup where a number of tenants have variations on a base Cube data model that is then extended or amended for each tenant based on their particular reporting needs.

Up until now we would need to handle these variations in each tenants’ cube data model by storing the base version in version control as a template and then write our own custom devops process to inject additional Cube YAML or JS dimension and measure definitions into those base models, hoping at the end that it’ll all still work and compile correctly.

Now we can simplify this tenant-level customisation by putting the customisable part of the cube model into a Python function, for example, and then referencing that via Jinja in the cube data model definition, like this:

# Define a Python function that returns a different dimension definition based on user input
def get_dimension(user_input):
    if user_input == 'option1':
        return "dimension1: { sql: ${CUBE}.'column1' ;; }"
    elif user_input == 'option2':
        return "dimension2: { sql: ${CUBE}.'column2' ;; }"
    else:  # Default to option3
        return "dimension3: { sql: ${CUBE}.'column3' ;; }"

# Use Jinja to inject the appropriate dimension definition into the cube definition
cube_definition = f"""
cube('MyCube', {{
  sql: \`SELECT * FROM my_table\`,

  dimensions: {{
    {% set user_input = 'option1' %}  # This would be provided by the user in a real scenario
    {% set dimension_definition = get_dimension(user_input) %}
    {{ dimension_definition }}
  }}
}});
"""
We could take this down to the individual dimension column level and tie this to user input in the front-end application, like this:
def get_dimension(user_input):
    if user_input == 'dealAmount':
        return "dealAmount: { sql: ${CUBE}.deal_amount ;; type: number }"
    elif user_input == 'dealType':
        return "dealType: { sql: CASE WHEN ${CUBE}.deal_type IS NULL THEN 'Existing Business' ELSE ${CUBE}.deal_type END ;; type: string }"
    else:  # Default to dealSource
        return "dealSource: { sql: ${CUBE}.deal_source ;; type: string }"

cube_definition = f"""
cube('deals_fact', {{
  sql_table: 'analytics.deals_fact',
  dimensions: {{
    {% set user_input = 'dealAmount' %}  # This would be provided by the user in a real scenario
    {% set dimension_definition = get_dimension(user_input) %}
    {{ dimension_definition }}
  }}
}});
"""

We could also dynamically change the aggregation method for a measure in this way too, like this:

def get_aggregation_level(user_input):
    if user_input == 'sum':
        return "totalDealAmount: { sql: ${CUBE}.deal_amount ;; type: sum }"
    elif user_input == 'avg':
        return "totalDealAmount: { sql: ${CUBE}.deal_amount ;; type: avg }"
    else:  # Default to count
        return "totalDealAmount: { sql: ${CUBE}.deal_amount ;; type: count }"

cube_definition = f"""
cube('deals_fact', {{
  sql_table: 'analytics.deals_fact',
  measures: {{
    {% set user_input = 'sum' %}  # This would be provided by the user in a real scenario
    {% set aggregation_definition = get_aggregation_level(user_input) %}
    {{ aggregation_definition }}
  }}
}});
"""

However that this doesn’t mean that you can now redefine a cube data model on the fly, as you still need to restart the Cube server to pick up the new data model definition and rebuild any pre-aggregations that may have changed from those defined in the old data model.

In projects we deliver using Cube we would typically handle ad-hoc analysis and on-the-fly groupings in the BI tool, for example Preset, rather than requiring amendments to the cube configuration. This approach allows for greater flexibility and ease of use for end-users, while also maintaining the integrity and consistency of the cube data model.

Environment Separation & Data Validation

Perhaps a more appropriate use of this new feature would be to use Python and Jinja to generate different cube definitions based on the current environment; for example, you might want to use different SQL tables for development and production environments.

def get_sql_table(environment):
    if environment == 'development':
        return "`analytics.deals_fact_dev`"
    else:  # Default to production
        return "`analytics.deals_fact`"

cube_definition = f"""
cube('deals_fact', {{
  sql_table: '{get_sql_table('development')}',  # This would be provided by the environment in a real scenario
  // ... rest of the cube definition ...
}});
"""

Or you might want to add data validation in your cube definitions, defining a Python function that checks the validity of user input or the results of a query for use with a Jinja template that includes this logic in the cube definition.

def validate_deal_amount(deal_amount):
    if deal_amount < 0:
        raise ValueError("Deal amount cannot be negative")
    return deal_amount

cube_definition = f"""
cube('deals_fact', {{
  sql_table: 'analytics.deals_fact',
  measures: {{
    totalDealAmount: {{
      sql: `validate_deal_amount(${CUBE}.deal_amount)`,
      type: `sum`
    }},
    // ... rest of the cube definition ...
  }}
}});
"""

Generating Cube Data Models from an API Call

Finally though let’s look at what’s probably the most relevant use of now having Python and Jinja support in Cube; the ability to generate data models from a remote data source, for example an API.

This new ability could be very useful, for example, in situations where you want to dynamically serve-up Cube data model definitions for a data source that evolves frequently, or where you want to programatically generate model definitions for new or temporary data sources.

In the code example below we’ve mocked-up an API to provide us with the necessary data to generate cube data models, along with a helper function to load data from the the mocked-up remote API that we’d then store in the model/globals.py file:

from cube import TemplateContext
template = TemplateContext()@template.function('load_data')
def load_data():
    client = MyApiClient("example.com")
    return client.load_data()class MyApiClient:
    def __init__(self, api_url):
        self.api_url = api_url    # mock API call
    def load_data(self):
        api_response = {
            "cubes": [
                {
                    "name": "cube_from_api",
                    "measures": [
                        {"name": "count", "type": "count"},
                        {"name": "total", "type": "sum", "sql": "amount"}
                    ],
                    "dimensions": []
                },
                {
                    "name": "cube_from_api_with_dimensions",
                    "measures": [
                        {"name": "active_users", "type": "count_distinct", "sql": "user_id"}
                    ],
                    "dimensions": [
                        {"name": "city", "sql": "city_column", "type": "string"}
                    ]
                }
            ]
        }
        return api_response

The code example defines a load_data function that uses a MyApiClient class to load data from a remote API. Note that the load_data function is decorated with the @template.function decorator, allowing it to be called from within a Jinja template.

Then we use the load_data function called from within a Jinja template to generate the Cube data model, with a loop around cube creation logic to generate a cube for each item returned by the API.

cube_definition = """
{% set data = load_data() %}
{% for cube in data.cubes %}
cube('{{ cube.name }}', {
  measures: {
    {% for measure in cube.measures %}
    {{ measure.name }}: {
      type: '{{ measure.type }}',
      sql: '{{ measure.sql }}'
    },
    {% endfor %}
  },
  dimensions: {
    {% for dimension in cube.dimensions %}
    {{ dimension.name }}: {
      type: '{{ dimension.type }}',
      sql: '{{ dimension.sql }}'
    },
    {% endfor %}
  }
})
{% endfor %}
"""

Interested? Find Out More!

Rittman Analytics is a boutique data analytics consultancy that works with growth-stage, mid-market and enterprise businesses in the UK, EU and North America who want to build a world-class data capability based around a modular, scalable modern data stack.

We’re authorised delivery partners for Cube along with Dagster, Preset, dbt Labs, Fivetran, Rudderstack and Google Cloud and are experts at helping you choose and implement the right ones for your organisation’s needs, use-cases and budget.

If you’re looking for some help and assistance with Cube or would just like to talk shop and share ideas and thoughts on what’s going on in your organisation and the wider data analytics world, contact us now to organise a 100%-free, no-obligation call — we’d love to hear from you!

Mark Rittman

CEO of Rittman Analytics, host of the Drill to Detail Podcast, ex-product manager and twice company founder.

https://rittmananalytics.com
Previous
Previous

Rittman Analytics Achieves the Data Analytics Partner Specialization in Google Cloud Partner Advantage

Next
Next

Data Analytics Project Planning Checklist:The Definitive Guide To Planning Your Data Analytics Initiative