Prerequisites for pre-aggregations

Before enabling pre-aggregations, it’s recommended to review the performance of these queries. In many cases, straightforward database optimisations are sufficient, and pre-aggregations may not be required.

This guide highlights the areas that most commonly affect query performance and are worth reviewing before introducing pre-aggregations.

Indexes

Indexes are the single most effective way to improve query performance. They allow the database to locate rows efficiently without scanning the entire table.

What is an index?

Without an index, the database scans every row to find matches:

events table
-------------------------
id | account_id | status
1  | 42         | active
2  | 17         | active
3  | 42         | inactive
4  | 99         | active

WHERE account_id = 42

Things to notice:

Every row is checked.
With an index on account_id, the database uses a lookup structure to jump directly to matching values instead of scanning rows one by one.
Indexes are sorted, which allows the database to navigate efficiently.

Conceptual example

Assume an index on account_id with these values:

12, 17, 42, 42, 58, 73, 99

                [42]
              /      \
         [17]          [73]
        /    \        /    \
     [12]   [42]   [58]   [99]

WHERE account_id = 58

The database follows the sorted structure and reaches 58 in a small number of comparisons.
The database does not scan values sequentially.

Why this matters

Full table scans check every row
Index lookups follow a small number of comparisons
Lookup cost remains predictable as data grows

Which columns should be indexed?

Index columns commonly used in:

Filters

WHERE account_id = 42 AND status = 'active'

Index:

account_id
status (if selective)

Joins

Always join on indexed columns.

JOIN users ON events.user_id = users.id

Index:

events.user_id
users.id

Grouping

GROUP BY account_id, event_date

Index:

account_id
event_date

Composite indexes

If filters commonly appear together:

WHERE account_id = 42 AND created_at >= ?

Create a composite index:

(account_id, created_at)

Column order should match common filter patterns.

Index pitfalls

Too many indexes

Inserts and updates become slower
Storage usage increases

Each index must be maintained as data changes.

Indexing low-value columns

Indexes are effective only when they significantly reduce the number of rows scanned.

Avoid indexing columns with very few distinct values, such as:

Boolean flags
Status columns where most rows share the same value

If a filter matches most rows, an index adds overhead without improving performance.

Breaking index usage

Functions and casts prevent index usage as it changes the indexed value, forcing the database to evaluate every row instead of using the index.

WHERE CAST(user_id AS INTEGER) = 42

WHERE LOWER(email) = 'test@example.com'

Primary key vs explicit indexes

Primary key index
- Created automatically
- Optimised for point lookups
- Rarely helps aggregation queries
Explicit indexes
- Created using CREATE INDEX
- Used for filters, joins, and groupings

Pre-aggregations typically benefit from explicit indexes, not from the primary key.

Partitioning Large Tables

Partitioning becomes useful as tables grow and queries filter by time.

What is partitioning?

Partitioning splits a large table into smaller physical segments, most commonly by time.

Example: partitioning a 1M-row table

Assume an events table with 1,000,000 rows spanning one year.

Without partitioning:

events (1,000,000 rows)
└── All data stored together

A query like:

WHERE created_at >= '2024-12-01'

Must scan a large portion of the table.

With monthly time-based partitioning:

events
├── events_2024_01 (~83k rows)
├── events_2024_02 (~83k rows)
├── ...
├── events_2024_12 (~83k rows)

The same query scans only:

events_2024_12

All other partitions are skipped.
This behaviour is known as partition pruning.

Why this matters

Far fewer rows are scanned
Less data is read from disk
Performance remains predictable as data grows

Partitioning is especially effective when queries consistently filter by time.

Best practices

Partition by time
Use daily or monthly partitions
Automate partition creation and cleanup

Review Query Execution Plans

Indexes and partitioning only help if the database actually uses them.

Query execution plans show how a query is executed and are the most reliable way to understand where time and resources are spent.

Most databases support:

EXPLAIN <query>

Use execution plans to:

Confirm indexes are being used
Verify partition pruning
Identify full table scans
Validate performance improvements

Red flags

Seq Scan on events
Filter: (created_at >= '2024-01-01' AND status = 'active')

A sequential scan means:

The entire table is read
Filters are applied after scanning all rows
Query cost grows linearly as the table grows

On large tables, this usually indicates:

Missing or unusable indexes
Functions preventing index usage
Partitioning not being applied

Sequential scans are expected on small tables. They are problematic on large, frequently queried tables.

Good signs

Index Scan using idx_events_created_at on events
Index Cond: (created_at >= '2024-01-01')

This indicates:

Indexes are being used
Filters are applied early
Fewer rows are scanned
Performance scales predictably

Execution plans should be reviewed before and after changes to confirm impact.

Next Steps

Once you have these optimisations in place to improve query performance, the next step is to use pre-aggregations to further speed up your queries.

Level 1 cache: in-memory Pre-aggregations 101