Level 2 cache: pre-aggregations
In addition to the Level 1 in-memory cache, Embeddable leverages Cube's Level 2 cache through pre-aggregations to enhance query performance and scalability.
What are pre-aggregations?
Pre-aggregations allow you to compute and store aggregations (like sums, averages, and counts, but also a set of dimensions) in advance rather than calculating them on the fly for each query. When a query can be fully answered from a pre-aggregation, results are retrieved directly from the pre-aggregated table instead of scanning raw data, significantly speeding up response times (and reducing cost and load on your databases). This also means that multiple components + charts on your dashboard can retrieve data from just one or a few pre-aggregations (meaning fewer queries).
Pre-aggregations: You can think of a pre-aggregation as a simple temporary table (or materialized view) but managed and stored automatically by Embeddable, rather than in your database. You control how often each pre-aggregation is refreshed using the Caching API
Why use pre-aggregations?
Without pre-aggregations, all data loading requests from your Embeddable components will hit your database(s) directly every time.
This has the advantage that users will always see the latest (up-to-date) data directly from the database. However, when a lot of concurrent users are viewing your dashboards at the same time, this can quickly mean a high load on your databases, which may affect their performance (both in terms of response time and financial cost):
- for OLTP (row-based) databases like Postgres, MySQL and SQLServer it can mean that some queries take too long to come back, as these databases are optimised for row-based lookup, not columnar aggregates.
- for OLAP (columnar) databases like Snowflake, BigQuery and Redshift it can mean significant financial costs in serving those incoming queries.
- in both cases, depending on your data, a query may simply require processing of many rows of data in order to answer an incoming data request.
Pre-aggregations allow you to pre-aggregate the results so that the incoming queries have less work to do to return a result (which in turn means fast results).
As an example, users may want to see their data by month or week. Going through all the data, every time, and grouping it by week and month is expensive and slow. Instead, you can define a pre-aggregation that pre-aggregates the data by month and by week. That way, the incoming queries can use this pre-aggregated data in order to skip the expensive steps and answer the query faster.
How do you define a pre-aggregation?
Pre-aggregations can currently only be defined in code, not in the in-platform Data Model Editor. To add pre-aggregations, please export it from the Data Model Editor and add it to your code repository instead (learn how here).
To set up a pre-aggregation you need to do two simple steps:
-
define any pre-aggegregations directly in your model files (see example below)
-
define a refresh schedule using the Caching API.
Each pre-aggregation is a list of dimensions and measures that you want to pre-aggregate. E.g. see the pre_aggregation
defined at the bottom of this model file:
cubes:
- name: customers
sql_table: public.customers
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: country
sql: country
type: string
- name: signed_up_at
sql: created_at
type: time
measures:
- name: count
type: count
title: "Number of Customers"
pre_aggregations:
- name: customer_countries
dimensions:
- CUBE.country
This simple pre-aggregation only has a name
(required) and a single dimension. This will lead to the following SQL being run (once) against your database:
SELECT
"customers".country "customers__country"
FROM
public.customers AS "customers"
GROUP BY
1
ORDER BY
1
and the results will be stored as “a pre-aggregation”:
customers__country |
---|
Australia |
Belgium |
Canada |
Germany |
… |
Refreshing pre-aggregations
You need to use Embeddable’s Caching API to set up a schedule for refreshing your pre-aggregations. Without the Caching API, your pre-aggregation will run only once (upon “publish”) and the data will not be refreshed. i.e. Your customers will only see the data from the initial pre-aggregation.
If you’ve read Cube’s documentation (opens in a new tab) you will find reference to the
refreshKey
. This is not supported for pre-aggregations in Embeddable (it is only supported for the in-memory cache).
Instead, please use our Caching API to define the schedule for your pre-aggregations.
How do pre-aggregations work?
Now that a pre-aggregation exists, any incoming data requests from your components will first look to see if there are any relevant pre-aggregations that it can use:
- If there are it will use them instead of hitting your database.
- If there are not, it will query your database instead, as usual.
So, the following query, from, say, a dropdown component, that wants to show a list of countries, can certainly use the above pre-aggregation, so it will 👍 (and thus it won’t hit your database).
{
"query": {
"dimensions": [
"customers.country"
],
"timeDimensions": [],
"filters": [],
"limit": 100
}
}
Whereas the below query is asking for the same data (customers.country
), but it wants to filter by the signed_up_at
dimension. That dimension is not in the pre-aggregation, so there is not enough information in the pre-aggregation to serve this query, thus the query will not use the pre-aggregation and will instead go directly to your database.
{
"query": {
"dimensions": [
"customers.country"
],
"timeDimensions": [
{
"dimension": "customers.signed_up_at",
"dateRange": "Yesterday"
}
],
"filters": [],
"limit": 100
}
}
Why do queries sometimes not use my pre-aggregations?
Importantly, if you want a query to use pre-aggregations, rather than hit your database, it must match exactly one pre-aggregation (even if you have defined multiple in your models). A query cannot combine the results of multiple pre-aggregations, or combine a pre-aggregation with data from your database. Any given query must be fully answerable from a single pre-aggregation, otherwise the query will always skip pre-aggregations and go straight to your database (you can see detailed docs on the matching algorithm here (opens in a new tab)).
Let’s look at some common examples where people get tripped up:
- All query dimensions, measures and filters must be in the pre-aggregation
- Time filters must match the requested granularity
- If measures aren’t “additive”, this can cause cache-misses
Examples follow below.
1. All query dimensions, measures and filters must be in the pre-aggregation
Given a model with the following “count_by_countries” pre_aggregation
:
cubes:
- name: customers
sql_table: public.customers
dimensions:
- name: id
sql: id
type: number
primary_key: true
- name: country
sql: country
type: string
- name: signed_up_at
sql: created_at
type: time
measures:
- name: count
type: count
title: "Number of Customers"
pre_aggregations:
- name: count_by_countries
measures:
- CUBE.count
dimensions:
- CUBE.country
the following (or similar) SQL will be run against your database to generate the pre-aggregation:
SELECT
"customers".country "customers__country",
count("customers".id) "count"
FROM
public.customers AS "customers"
GROUP BY
1
ORDER BY
2 DESC
which, in turn, will lead to the following pre-aggregation (or similar) being cached:
customers__country | count |
---|---|
United States | 102 |
Australia | 64 |
Germany | 48 |
Canada | 30 |
Belgium | 5 |
… |
That means that any of the following data queries can (and will) be served by this pre-aggregation (rather than going to your database):
-
Count by country:
{ "query": { "dimensions": [ "customers.country" ], "measures": [ "customers.count" ] } }
Cube can use all the columns and rows from the pre-aggregation above to serve this query, so it will use the pre-aggregation and it won't go to your database.
-
Count customers:
{ "query": { "dimensions": [], "measures": [ "customers.count" ] } }
Cube can sum up the “count” column from the above pre-aggregation, so will use the pre-aggregation to serve this query.
-
Count US customers:
{ "query": { "measures": [ "customers.count" ], "filters": [ { "member": "customers.country", "operator": "equals", "values": [ "United States" ] } ] } }
Cube will take the “count” from the “United States” row from the pre-aggregation.
-
Countries with more than 30 customers:
{ "query": { "dimensions": [ "customers.country" ], "measures": [], "filters": [ { "member": "customers.count", "operator": "gt", "values": [ "30" ] } ], "order": [ [ "customers.country", "asc" ] ] } }
Cube will filter the pre-aggregation by the “count” column in the pre-aggregation and then sort the results by the “customers__country” column.
Any queries, however, where one or more of the dimensions, measures or filters in the query are not available in a pre-aggregation, will skip the pre-aggregations and hit the database instead.
For example this query:
{
"query": {
"dimensions": [
"customers.id"
"customers.country"
],
"measures": []
}
}
is asking for the “id” and “country” dimensions. This “country” dimension is in our pre-aggregation, but the “id” dimension is not. Cube will therefore go to your database instead to serve this query:
SELECT
"customers".id "customers__id",
"customers".country "customers__country"
FROM
public.customers AS "customers"
because Cube can only use a pre-aggregation if it can, by itself, fully serve the given query.
2. Time filters must match the requested granularity
You can include time_dimension
in your pre_aggregations
(additionally to measures
and dimensions
) like so:
cubes:
- name: customers
...
pre_aggregations:
- name: count_by_week
measures:
- CUBE.count
time_dimension: CUBE.signed_up_at
granularity: week
- name: count_by_month
measures:
- CUBE.count
time_dimension: CUBE.signed_up_at
granularity: month
These two pre-aggregation will lead to the following SQLs (or similar) to be run against your database:
SELECT
date_trunc('week', "customers".signed_up_at) "customers__signed_up_at_week",
count("customers".id) "customers__count"
FROM
public.customers AS "customers"
GROUP BY
1
and
SELECT
date_trunc('month', "customers".signed_up_at) "customers__signed_up_at_month",
count("customers".id) "customers__count"
FROM
public.customers AS "customers"
GROUP BY
1
respectively. The results will be cached as pre-aggregations.
Notice how the
time_dimension
in thepre_aggregation
also takes agranularity
. In order to pre-aggregate by multiple granularities you must define a separate pre-aggregation for each granularity (like we’ve done in the "customers" model above).
For any incoming queries, Cube will try to find a pre-aggregation that can fully serve its dimensions, measures and filters (including timeDimensions).
Examples:
-
Count by month:
{ "query": { "measures": [ "customers.count" ], "timeDimensions": [ { "dimension": "customers.signed_up_at", "granularity": "month" } ] } }
This query perfectly matches the results in our “count_by_month” pre-aggregation, so Cube will use it.
-
Count of customers for last week:
{ "query": { "measures": [ "customers.count" ], "timeDimensions": [ { "dimension": "customers.signed_up_at", "dateRange": "Last week" } ] } }
Cube can filter our “count_by_week” pre-aggregation to the row from last week in order to serve this query.
-
Yesterday:
{ "query": { "measures": [ "customers.count" ], "timeDimensions": [ { "dimension": "customers.signed_up_at", "dateRange": "Yesterday" } ] } }
Neither of our pre-aggregations can answer this query (as there isn't a "day" granularity and you can't derive "day" from "week" or "month"), so Cube will skip pre-aggregations for this query and get it instead from our database.
-
2024:
{ "query": { "measures": [ "customers.count" ], "timeDimensions": [ { "dimension": "customers.signed_up_at", "dateRange": ["2024-01-01", "2024-12-31"] } ] } }
We don’t have a “year” pre-aggregation, but Cube is clever. We have a “month” pre-aggregation, and Cube can use that to sum up the counts for each month in 2024, and thus Cube will hit our “count_by_month” pre-aggregation to serve this query.
3. If measures aren’t “additive”, this can cause cache-misses
Above we have described multiple situations where Cube will serve an incoming data query, not just by returning a subset of the rows in the pre-aggregation, but by actually aggregating the counts from the rows.
E.g. in the last example above, Cube will calculate the number of customers for 2024 by summing up the counts for the 12 months of 2024. This works because measures of type “count” are “additive”. i.e. if you were to query your database directly for a count of customers who signed up in 2024, it will return exactly the same number as if you asked for the counts of each month in 2024 and sum them together.
Additivity is “a property of measures that determine whether measure values, once calculated for a set of dimensions, can be [accurately] further aggregated to calculate measure values for a subset of these dimensions” (source (opens in a new tab)).
The only additive measures are:
- count (opens in a new tab)
- count_distinct_approx (opens in a new tab)
- min (opens in a new tab)
- max (opens in a new tab)
- sum (opens in a new tab)
All other measures (avg, (opens in a new tab) count_distinct (opens in a new tab), etc.) are not:
avg
calculates the average of a set of numbers. If January has three numbers: 3, 5 and 4 (which gives an average of 4) and February has just one number: 2 (whose average is obviously 2), then the average of January and February should be (3+5+4+2)/4 = 3.5. But if all you know are the two averages, 4 and 2, there is no way to accurately derive the correct average (you could average 4 and 2 to get 3, but this is incorrect). Thusavg
measures are not additive.count_distinct
counts, not the total number of something (say, total visitors to a shop), but the unique (or distinct) number of something (say, unique visitors). If there were 5 unique visitors in January and 3 unique visitors in February, it does not mean that were (5+3) 8 unique visitors in January and February. Some of these visitors may have visited in both January and February, so the actual number may be as low as 5 (but may be higher). Thuscount_distinct
measures are also not additive.
And so, because Cube cares about giving correct numbers back, if your pre-aggregation is using non-additive measures (like avg
and count_distinct
) then it can still use the pre-aggregation if the query can be answered by returning a subset of the rows from the pre-aggregation, but if it would require aggregating them together in some way then it can’t use the pre-aggregation, and Cube will choose to skip the pre-aggregation and go to your database instead.
Workarounds for non-additive measures
You may be wondering: what if I want to use pre-aggregations for non-additive measures like avg
and count_distinct
?
There are a few workarounds:
-
avg
is, in the end, just a calculation of a total divided by a count. Both a total and a count are additive, so you can separate the calculation into two additive measures and then calculate the average dynamically (notice thataverage_price
below is purposefully not part of the pre-aggregation):cubes: - name: customers ... measures: - name: total_price sql: price type: sum - name: count type: count - name: average_price sql: {total_price} / {count} type: number pre_aggregations: - name: price_and_count measures: - CUBE.total_price - CUBE.count
-
count_distinct
is not additive (as explained in the previous section) but Cube has introduced a useful measure type calledcount_distinct_approx
which tries to approximate the number of unique (distinct) values (learn more here (opens in a new tab)). It is useful if exact numbers are not essential and a rough estimate is sufficient.Just bear in mind:
- the more rows from a pre-aggregation that you’re combining, the worse the approximation will be (e.g. if you’re calculating 2024 by aggregating 365 separate days of a
count_distinct_approx
measure, please don’t expect the result to be very accurate). - unfortunately not all database types support
count_distinct_approx
. Check your specific database (e.g. here (opens in a new tab) you can see that MySQL doesn’t support it)
- the more rows from a pre-aggregation that you’re combining, the worse the approximation will be (e.g. if you’re calculating 2024 by aggregating 365 separate days of a
-
And finally, you can use non-additive measures in pre-aggregations confidently as long as your queries match the row granularity of your pre-aggregations perfectly. E.g. if your pre-aggregation shows
count_distinct
byweek
and your query requests thatcount_distinct
measure for a particular week, Cube will use the pre-aggregation and return a perfectly accurate number.
Performing joins across cubes in your pre-aggregations
All the examples so far have shown pre-aggregations using just the dimensions and measures from the cube in which it is defined. This is, however, not a requirement at all. You can easily define a pre-aggregation that uses dimensions and measures from multiple cubes, as long as the appropriate joins
have been defined.
E.g. here you can see our pre_aggregation
uses two dimensions (”make” and “model”) from the “vehicles” cube:
cubes:
- name: person
...
joins:
- name: vehicles
sql: "{CUBE}.id = {vehicles}.owner_id"
relationship: one_to_many
pre_aggregations:
- name: person_vehicles
dimensions:
- CUBE.full_name
- vehicles.make
- vehicles.model
measures:
- CUBE.count
time_dimension: CUBE.created_at
granularity: day
Indexes
To get the best performance out of your pre-aggregations you will likely want to define indexes too.
Cube recommends "for most queries, there should be at least one index that makes a particular query scan very little amount of data”. You can read all about indexes here (opens in a new tab).