Advanced Pre-aggregations
This guide covers a set of advanced pre-aggregation topics that help you optimise performance and handle more complex data scenarios.
Handling incremental data loads
Sometimes your source data is updated incrementally for example: only the last few days are reloaded or updated while older data remains unchanged. In these cases, it’s more efficient to build your pre-aggregations incrementally instead of rebuilding the entire dataset.
Using the customers cube example:
pre_aggregations:
- name: daily_count_by_countries
measures:
- CUBE.count
dimensions:
- CUBE.country
time_dimension: CUBE.signed_up_at
granularity: day
partition_granularity: day
build_range_start:
sql: SELECT NOW() - INTERVAL '365 day'
build_range_end:
sql: SELECT NOW()
refresh_key:
every: 1 day
incremental: true
update_window: 3 dayThings to notice:
- Most queries focus on the past year, so we limit the build range to 365 days using
build_range_startandbuild_range_end. Learn more here (opens in a new tab). partition_granularity: daysplits the pre-aggregation into daily partitions, making it possible to refresh only the days that change instead of rebuilding the whole year.- Partitioned pre-aggregations require both a
time_dimensionand agranularity. See the Cube docs on supported values (opens in a new tab). - With
incremental: trueandupdate_window: 3 day, Cube refreshes only the last three partitions each day. Learn more aboutupdate_window(opens in a new tab) andincremental(opens in a new tab) .
Without update_window, Cube refreshes partitions strictly according to partition_granularity (in this case, just the last day).
Indexes
Indexes make data retrieval faster. Think of an index as a shortcut that points directly to the relevant rows instead of searching through all the data. This speeds up queries that filter, group, or join on specific fields.
In the context of pre-aggregations, indexes help Cube Store (opens in a new tab) quickly locate and read only the data needed for a query improving performance, especially on large datasets.
Indexes are particularly useful when:
- For larger pre-aggregations, indexes are often required to achieve optimal performance, especially when a query doesn’t use all dimensions from the pre-aggregation.
- Queries frequently filter on high-cardinality dimensions, such as
product_idordate. Indexes help Cube Store find matching rows faster in these cases. - You plan to join one pre-aggregation with another, such as in a
rollup_join.
Adding indexes doesn’t change your data, it simply makes Cube Store more efficient at finding it.
Using indexes in pre-aggregations
Let’s start with a simple products model and define a products_preagg pre-aggregation.
Here we add an index on size within our pre-aggregation, which Cube Store uses to quickly resolve joins and filters involving that indexed column.
cubes:
- name: products
sql_table: my_db.main.products
data_source: default
dimensions:
- name: id
sql: id
type: number
primary_key: true
public: true
- name: name
sql: name
type: string
- name: size
sql: size
type: string
measures:
- name: count
type: count
title: "# of products"
- name: price
type: sum
title: Total USD
sql: price
joins:
- name: orders
sql: "{CUBE.id} = {orders.product_id}"
relationship: one_to_many
pre_aggregations:
- name: products_preagg
type: rollup
dimensions:
- size
measures:
- count
- price
indexes:
- name: product_index
columns:
- sizeIn this example:
-
The
products_preaggpre-aggregation stores aggregated products data by size dimension. -
The index
product_indexonsizespeeds up queries using that dimension. -
Make sure the column you’re indexing is also included in the pre-aggregation dimensions; otherwise, Cube will return an error like:
Error during create table: Column 'products__id' in index 'products_products_preagg_product_index' is not found in table 'products_products_preagg'
Each index adds to the pre-aggregation build time, since all indexes are created during ingestion. Add only the ones you need.
Learn more about indexes here (opens in a new tab).
Rollup_join
-
Cube can run SQL joins across different data sources. For example, you might have products in PostgreSQL and orders in MotherDuck.
-
All pre-aggregations so far have been of type rollup (which is the default pre-aggregation type). Cube also supports
rollup_join, which combines data from two or more rollups coming from different data sources. -
rollup_joinjoins pre-aggregated data inside cube store (opens in a new tab), so you can query it together efficiently.
You don’t need a rollup_join to join cubes from the same data source. Just include the other cube’s dimensions and measures directly in your rollup definition as mentioned here
Let’s extend the example from the indexes section. We’ll keep the products model from the PostgreSQL (default) data source. Since it joins to the orders model on the id column, we’ll need to update the pre-aggregation to include id and name and add an index on it.
pre_aggregations:
- name: products_preagg
type: rollup
dimensions:
- id
- name
- size
measures:
- count
- price
indexes:
- name: product_index
columns:
- id
refresh_key:
every: 1 hourThe new orders model from MotherDuck data source will be added to show how to run analytics across databases.
cubes:
- name: orders
sql_table: public.orders
data_source: motherduck
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: created_at
sql: created_at
type: time
- name: product_id
sql: product_id
type: number
public: false
measures:
- name: count
type: count
title: "# of orders"
joins:
- name: products
sql: "{CUBE.product_id} = {products.id}"
relationship: many_to_one
pre_aggregations:
- name: orders_preagg
type: rollup
dimensions:
- product_id
- created_at
measures:
- count
time_dimension: CUBE.created_at
granularity: day
indexes:
- name: orders_index
columns:
- product_id
refresh_key:
every: 1 hour
- name: orders_with_products_rollup
type: rollup_join
dimensions:
- products.name
- orders.created_at
measures:
- orders.count
time_dimension: orders.created_at
granularity: day
rollups:
- products.products_preagg
- orders_preaggThings to notice:
-
ordersuses the MotherDuck data source. -
productsuses default data source (for example, PostgreSQL). Learn more about connecting to multiple datasources here. -
Always reference dimensions explicitly in your joins between models, especially when using a
rollup_join:joins: - name: products sql: "{CUBE.product_id} = {products.id}" relationship: many_to_oneIf you use
{CUBE}.product_idor{products}.id, Cube will not recognise them as dimension references and will return an error like:From members are not found in [] for join ... Please make sure join fields are referencing dimensions instead of columns. -
Indexes are required when using
rollup_joinpre-aggregations so Cube Store can join multiple pre-aggregations efficiently.Without the right index, Cube may fail to plan the join and return an error like:
Error during planning: Can't find index to join table ... Consider creating index ... ON ... (orders__product_id)Therefore, notice that we have indexed the join keys on both sides:
- `products.products_preagg` → index on `id` - `orders.orders_preagg` → index on `product_id` -
orders_with_products_rollupcombines both pre-aggregations inside Cube Store using the typerollup_join.The
rollups:property lists which pre-aggregations to join together:rollups: - products.products_preagg - orders_preagg -
We also added a
time_dimensionwith day-level granularity inorders_with_products_rollup.We expect users to ask questions at a daily level, such as “How many orders were placed per product each day?”. Setting the
time_dimensionto day ensures Cube builds and queries this data efficiently.
rollup_join is an ephemeral pre-aggregation. It uses the referenced pre-aggregations at query time, so freshness is controlled by them, not the rollup_join itself.
- Notice that we’ve set the
refresh_keyto 1 hour on both referenced pre-aggregations (products_preaggandorders_preagg) to keep the data up to date. Learn more about refreshing pre-aggregations here.
How rollup_join works in Embeddable
In this example, we’ll find the total number of orders for each product. The product name comes from the products model, while the orders count comes from the orders model.
Things to notice:
- The query’s FROM clause references both pre-aggregations. This is how Cube joins pre-aggregated datasets from different data sources inside Cube Store.
Benefits of using rollup_join
- Enables cross-database joins inside Cube Store
- Leverages indexed pre-aggregations for efficient distributed joins
- Avoids the need for ETL or database federation
- Provides consistent, scalable analytics across data sources
Learn more about rollup_join here (opens in a new tab).
Next Steps
The next step is to setup Embeddable’s Caching API to refresh pre-aggregations for each of your security contexts. Without it, pre-aggregations will only refresh on demand.