Data modeling
Caching
Pre-aggregations

Level 2 cache: pre-aggregations

In addition to the Level 1 in-memory cache, Embeddable leverages Cube's Level 2 cache through pre-aggregations to enhance query performance and scalability.

What are pre-aggregations?

Pre-aggregations allow you to compute and store aggregations (like sums, averages, and counts, but also a set of dimensions) in advance rather than calculating them on the fly for each query. When a query can be fully answered from a pre-aggregation, results are retrieved directly from the pre-aggregated table instead of scanning raw data, significantly speeding up response times (and reducing cost and load on your databases). This also means that multiple components + charts on your dashboard can retrieve data from just one or a few pre-aggregations (meaning fewer queries).

Pre-aggregations: You can think of a pre-aggregation as a simple temporary table (or materialized view) but managed and stored automatically by Embeddable, rather than in your database. You control how often each pre-aggregation is refreshed using the Caching API

Why use pre-aggregations?

Without pre-aggregations, all data loading requests from your Embeddable components will hit your database(s) directly every time.

This has the advantage that users will always see the latest (up-to-date) data directly from the database. However, when a lot of concurrent users are viewing your dashboards at the same time, this can quickly mean a high load on your databases, which may affect their performance (both in terms of response time and financial cost):

  1. for OLTP (row-based) databases like Postgres, MySQL and SQLServer it can mean that some queries take too long to come back, as these databases are optimised for row-based lookup, not columnar aggregates.
  2. for OLAP (columnar) databases like Snowflake, BigQuery and Redshift it can mean significant financial costs in serving those incoming queries.
  3. in both cases, depending on your data, a query may simply require processing of many rows of data in order to answer an incoming data request.

Pre-aggregations allow you to pre-aggregate the results so that the incoming queries have less work to do to return a result (which in turn means fast results).

As an example, users may want to see their data by month or week. Going through all the data, every time, and grouping it by week and month is expensive and slow. Instead, you can define a pre-aggregation that pre-aggregates the data by month and by week. That way, the incoming queries can use this pre-aggregated data in order to skip the expensive steps and answer the query faster.

How do you define a pre-aggregation?

💡

Pre-aggregations can currently only be defined in code, not in the in-platform Data Model Editor. To add pre-aggregations, please export it from the Data Model Editor and add it to your code repository instead (learn how here).

To set up a pre-aggregation you need to do two simple steps:

  1. define any pre-aggegregations directly in your model files (see example below)

  2. define a refresh schedule using the Caching API.

Each pre-aggregation is a list of dimensions and measures that you want to pre-aggregate. E.g. see the pre_aggregation defined at the bottom of this model file:

cubes:
  - name: customers
    sql_table: public.customers
 
    dimensions:
      - name: id
        sql: id
        type: number
        primary_key: true
      - name: country
        sql: country
        type: string
      - name: signed_up_at
        sql: created_at
        type: time
        
    measures:
      - name: count
        type: count
        title: "Number of Customers"
 
    pre_aggregations:
      - name: customer_countries
        dimensions:
          - CUBE.country

This simple pre-aggregation only has a name (required) and a single dimension. This will lead to the following SQL being run (once) against your database:

SELECT
  "customers".country "customers__country"
FROM
  public.customers AS "customers"
GROUP BY
  1
ORDER BY
  1

and the results will be stored as “a pre-aggregation”:

customers__country
Australia
Belgium
Canada
Germany

Refreshing pre-aggregations

You need to use Embeddable’s Caching API to set up a schedule for refreshing your pre-aggregations. Without the Caching API, your pre-aggregation will run only once (upon “publish”) and the data will not be refreshed. i.e. Your customers will only see the data from the initial pre-aggregation.

If you’ve read Cube’s documentation (opens in a new tab) you will find reference to the refreshKey . This is not supported for pre-aggregations in Embeddable (it is only supported for the in-memory cache).

Instead, please use our Caching API to define the schedule for your pre-aggregations.

How do pre-aggregations work?

Now that a pre-aggregation exists, any incoming data requests from your components will first look to see if there are any relevant pre-aggregations that it can use:

  1. If there are it will use them instead of hitting your database.
  2. If there are not, it will query your database instead, as usual.

So, the following query, from, say, a dropdown component, that wants to show a list of countries, can certainly use the above pre-aggregation, so it will 👍 (and thus it won’t hit your database).

{
    "query": {
        "dimensions": [
            "customers.country"
        ],
        "timeDimensions": [],
        "filters": [],
        "limit": 100
    }
}

Whereas the below query is asking for the same data (customers.country), but it wants to filter by the signed_up_at dimension. That dimension is not in the pre-aggregation, so there is not enough information in the pre-aggregation to serve this query, thus the query will not use the pre-aggregation and will instead go directly to your database.

{
    "query": {
        "dimensions": [
            "customers.country"
        ],
        "timeDimensions": [
            {
                "dimension": "customers.signed_up_at",
                "dateRange": "Yesterday"
            }
        ],
        "filters": [],
        "limit": 100
    }
}

Why do queries sometimes not use my pre-aggregations?

Importantly, if you want a query to use pre-aggregations, rather than hit your database, it must match exactly one pre-aggregation (even if you have defined multiple in your models). A query cannot combine the results of multiple pre-aggregations, or combine a pre-aggregation with data from your database. Any given query must be fully answerable from a single pre-aggregation, otherwise the query will always skip pre-aggregations and go straight to your database (you can see detailed docs on the matching algorithm here (opens in a new tab)).

Let’s look at some common examples where people get tripped up:

  1. All query dimensions, measures and filters must be in the pre-aggregation
  2. Time filters must match the requested granularity
  3. If measures aren’t “additive”, this can cause cache-misses

Examples follow below.

1. All query dimensions, measures and filters must be in the pre-aggregation

Given a model with the following “count_by_countries” pre_aggregation:

cubes:
  - name: customers
    sql_table: public.customers
  
    dimensions:
        - name: id
          sql: id
          type: number
          primary_key: true
        - name: country
          sql: country
          type: string
        - name: signed_up_at
          sql: created_at
          type: time
          
      measures:
        - name: count
          type: count
          title: "Number of Customers"
   
      pre_aggregations:
        - name: count_by_countries
          measures:
            - CUBE.count
          dimensions:
            - CUBE.country
        

the following (or similar) SQL will be run against your database to generate the pre-aggregation:

SELECT
  "customers".country "customers__country",
  count("customers".id) "count"
FROM
  public.customers AS "customers"
GROUP BY
  1
ORDER BY
  2 DESC

which, in turn, will lead to the following pre-aggregation (or similar) being cached:

customers__countrycount
United States102
Australia64
Germany48
Canada30
Belgium5

That means that any of the following data queries can (and will) be served by this pre-aggregation (rather than going to your database):

  1. Count by country:

    {
        "query": {
            "dimensions": [
                "customers.country"
            ],
            "measures": [
                "customers.count"
            ]
        }
    }

    Cube can use all the columns and rows from the pre-aggregation above to serve this query, so it will use the pre-aggregation and it won't go to your database.

  2. Count customers:

    {
        "query": {
            "dimensions": [],
            "measures": [
                "customers.count"
            ]
        }
    }

    Cube can sum up the “count” column from the above pre-aggregation, so will use the pre-aggregation to serve this query.

  3. Count US customers:

    {
        "query": {
            "measures": [
                "customers.count"
            ],
            "filters": [
                {
                    "member": "customers.country",
                    "operator": "equals",
                    "values": [
                        "United States"
                    ]
                }
            ]
        }
    }

    Cube will take the “count” from the “United States” row from the pre-aggregation.

  4. Countries with more than 30 customers:

    {
        "query": {
            "dimensions": [
                "customers.country"
            ],
            "measures": [],
            "filters": [
                {
                    "member": "customers.count",
                    "operator": "gt",
                    "values": [
                        "30"
                    ]
                }
            ],
            "order": [
                [
                    "customers.country",
                    "asc"
                ]
            ]
        }
    }

    Cube will filter the pre-aggregation by the “count” column in the pre-aggregation and then sort the results by the “customers__country” column.

Any queries, however, where one or more of the dimensions, measures or filters in the query are not available in a pre-aggregation, will skip the pre-aggregations and hit the database instead.

For example this query:

{
    "query": {
        "dimensions": [
            "customers.id"
            "customers.country"
        ],
        "measures": []
    }
}

is asking for the “id” and “country” dimensions. This “country” dimension is in our pre-aggregation, but the “id” dimension is not. Cube will therefore go to your database instead to serve this query:

SELECT
  "customers".id "customers__id",
  "customers".country "customers__country"
FROM
  public.customers AS "customers"

because Cube can only use a pre-aggregation if it can, by itself, fully serve the given query.

2. Time filters must match the requested granularity

You can include time_dimension in your pre_aggregations (additionally to measures and dimensions) like so:

cubes:
  - name: customers
  
    ...
 
    pre_aggregations:
 
      - name: count_by_week
        measures:
          - CUBE.count
        time_dimension: CUBE.signed_up_at
        granularity: week
 
      - name: count_by_month
        measures:
          - CUBE.count
        time_dimension: CUBE.signed_up_at
        granularity: month

These two pre-aggregation will lead to the following SQLs (or similar) to be run against your database:

SELECT
  date_trunc('week', "customers".signed_up_at) "customers__signed_up_at_week",
  count("customers".id) "customers__count"
FROM
  public.customers AS "customers"
GROUP BY
  1

and

SELECT
  date_trunc('month', "customers".signed_up_at) "customers__signed_up_at_month",
  count("customers".id) "customers__count"
FROM
  public.customers AS "customers"
GROUP BY
  1

respectively. The results will be cached as pre-aggregations.

Notice how the time_dimension in the pre_aggregation also takes a granularity. In order to pre-aggregate by multiple granularities you must define a separate pre-aggregation for each granularity (like we’ve done in the "customers" model above).

For any incoming queries, Cube will try to find a pre-aggregation that can fully serve its dimensions, measures and filters (including timeDimensions).

Examples:

  1. Count by month:

    {
        "query": {
            "measures": [
                "customers.count"
            ],
            "timeDimensions": [
                {
                    "dimension": "customers.signed_up_at",
                    "granularity": "month"
                }
            ]
        }
    }

    This query perfectly matches the results in our “count_by_month” pre-aggregation, so Cube will use it.

  2. Count of customers for last week:

    {
        "query": {
            "measures": [
                "customers.count"
            ],
            "timeDimensions": [
                {
                    "dimension": "customers.signed_up_at",
                    "dateRange": "Last week"
                }
            ]
        }
    }

    Cube can filter our “count_by_week” pre-aggregation to the row from last week in order to serve this query.

  3. Yesterday:

    {
        "query": {
            "measures": [
                "customers.count"
            ],
            "timeDimensions": [
                {
                    "dimension": "customers.signed_up_at",
                    "dateRange": "Yesterday"
                }
            ]
        }
    }

    Neither of our pre-aggregations can answer this query (as there isn't a "day" granularity and you can't derive "day" from "week" or "month"), so Cube will skip pre-aggregations for this query and get it instead from our database.

  4. 2024:

    {
        "query": {
            "measures": [
                "customers.count"
            ],
            "timeDimensions": [
                {
                    "dimension": "customers.signed_up_at",
                    "dateRange": ["2024-01-01", "2024-12-31"]
                }
            ]
        }
    }

    We don’t have a “year” pre-aggregation, but Cube is clever. We have a “month” pre-aggregation, and Cube can use that to sum up the counts for each month in 2024, and thus Cube will hit our “count_by_month” pre-aggregation to serve this query.

3. If measures aren’t “additive”, this can cause cache-misses

Above we have described multiple situations where Cube will serve an incoming data query, not just by returning a subset of the rows in the pre-aggregation, but by actually aggregating the counts from the rows.

E.g. in the last example above, Cube will calculate the number of customers for 2024 by summing up the counts for the 12 months of 2024. This works because measures of type “count” are “additive”. i.e. if you were to query your database directly for a count of customers who signed up in 2024, it will return exactly the same number as if you asked for the counts of each month in 2024 and sum them together.

Additivity is “a property of measures that determine whether measure values, once calculated for a set of dimensions, can be [accurately] further aggregated to calculate measure values for a subset of these dimensions” (source (opens in a new tab)).

The only additive measures are:

All other measures (avg, (opens in a new tab) count_distinct (opens in a new tab), etc.) are not:

  • avg calculates the average of a set of numbers. If January has three numbers: 3, 5 and 4 (which gives an average of 4) and February has just one number: 2 (whose average is obviously 2), then the average of January and February should be (3+5+4+2)/4 = 3.5. But if all you know are the two averages, 4 and 2, there is no way to accurately derive the correct average (you could average 4 and 2 to get 3, but this is incorrect). Thus avg measures are not additive.
  • count_distinct counts, not the total number of something (say, total visitors to a shop), but the unique (or distinct) number of something (say, unique visitors). If there were 5 unique visitors in January and 3 unique visitors in February, it does not mean that were (5+3) 8 unique visitors in January and February. Some of these visitors may have visited in both January and February, so the actual number may be as low as 5 (but may be higher). Thus count_distinct measures are also not additive.

And so, because Cube cares about giving correct numbers back, if your pre-aggregation is using non-additive measures (like avg and count_distinct) then it can still use the pre-aggregation if the query can be answered by returning a subset of the rows from the pre-aggregation, but if it would require aggregating them together in some way then it can’t use the pre-aggregation, and Cube will choose to skip the pre-aggregation and go to your database instead.

Workarounds for non-additive measures

You may be wondering: what if I want to use pre-aggregations for non-additive measures like avg and count_distinct?

There are a few workarounds:

  1. avg is, in the end, just a calculation of a total divided by a count. Both a total and a count are additive, so you can separate the calculation into two additive measures and then calculate the average dynamically (notice that average_price below is purposefully not part of the pre-aggregation):

    cubes:
      - name: customers
      
        ...
        
        measures:
          - name: total_price
            sql: price
            type: sum
          - name: count
            type: count
          - name: average_price
            sql: {total_price} / {count}
            type: number
     
        pre_aggregations:
          - name: price_and_count
            measures:
              - CUBE.total_price
              - CUBE.count
            
  2. count_distinct is not additive (as explained in the previous section) but Cube has introduced a useful measure type called count_distinct_approx which tries to approximate the number of unique (distinct) values (learn more here (opens in a new tab)). It is useful if exact numbers are not essential and a rough estimate is sufficient.

    Just bear in mind:

    • the more rows from a pre-aggregation that you’re combining, the worse the approximation will be (e.g. if you’re calculating 2024 by aggregating 365 separate days of a count_distinct_approx measure, please don’t expect the result to be very accurate).
    • unfortunately not all database types support count_distinct_approx. Check your specific database (e.g. here (opens in a new tab) you can see that MySQL doesn’t support it)
  3. And finally, you can use non-additive measures in pre-aggregations confidently as long as your queries match the row granularity of your pre-aggregations perfectly. E.g. if your pre-aggregation shows count_distinct by week and your query requests that count_distinct measure for a particular week, Cube will use the pre-aggregation and return a perfectly accurate number.

Performing joins across cubes in your pre-aggregations

All the examples so far have shown pre-aggregations using just the dimensions and measures from the cube in which it is defined. This is, however, not a requirement at all. You can easily define a pre-aggregation that uses dimensions and measures from multiple cubes, as long as the appropriate joins have been defined.

E.g. here you can see our pre_aggregation uses two dimensions (”make” and “model”) from the “vehicles” cube:

cubes:
  - name: person
  
    ...
      
    joins:
      - name: vehicles
        sql: "{CUBE}.id = {vehicles}.owner_id"
        relationship: one_to_many
      
    pre_aggregations:
      - name: person_vehicles
        dimensions:
          - CUBE.full_name
          - vehicles.make
          - vehicles.model
        measures:
          - CUBE.count
        time_dimension: CUBE.created_at
        granularity: day

Indexes

To get the best performance out of your pre-aggregations you will likely want to define indexes too.

Cube recommends "for most queries, there should be at least one index that makes a particular query scan very little amount of data”. You can read all about indexes here (opens in a new tab).