Skip to content

Commit f8b9964

Browse files
authored
docs: More info on indexes
1 parent 019837f commit f8b9964

File tree

1 file changed

+24
-13
lines changed

1 file changed

+24
-13
lines changed

docs/docs-new/pages/product/caching/using-pre-aggregations.mdx

Lines changed: 24 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -228,15 +228,22 @@ cube(`orders`, {
228228

229229
### When to use indexes?
230230

231-
Indexes are great when you filter large amounts of data across one or several
232-
dimension columns. You can read more about them
233-
[here][ref-schema-ref-preaggs-index].
231+
When you define pre-aggregation without any indexes, the default index will be created.
232+
For the default index, dimensions come first, time dimensions come second, and measures come last.
233+
At query time, if the default index can't be selected for merge sort scan, then hash aggregation would be used.
234+
It usually means that the full table needs to be scanned to get query results.
235+
And it's usually no big deal if the pre-aggregation table is only several MB in size.
236+
Once you go over, indexes are usually required to achieve optimal performance.
237+
Especially if not all columns from pre-aggregation are used in a particular query.
238+
You can read more about indexes [here][ref-schema-ref-preaggs-index].
234239

235240
### Best Practices
236241

237242
To maximize performance, you can introduce an index per type of query so the set
238243
of dimensions used in the query overlap as much as possible with the ones
239-
defined in the index. Measures are traditionally only used in indexes if you
244+
defined in the index.
245+
As indexes are sorted copies of the data, you don't incur any additional costs on the data warehouse side, however, you multiply your build time for a given pre-aggregation with every index added.
246+
Measures are traditionally only used in indexes if you
240247
plan to filter a measured value and the cardinality of the possible values of
241248
the measure is low.
242249

@@ -245,6 +252,10 @@ suboptimal ordering can lead to diminished performance. To improve the
245252
performance of an index the main thing to consider is the order of the columns
246253
defined in it.
247254

255+
The key property of additive rollups is that for most queries, there's at least one index that makes a particular query scan very little amount of data which makes it very fast.
256+
There however exceptions to this rule like TopK queries, use of low selectivity range filters without high selectivity single value filters, etc.
257+
Optimization of those use cases usually should be handled by remodeling data and queries.
258+
248259
The rule of thumb for index column order is:
249260

250261
- Single value filters come first
@@ -258,11 +269,11 @@ structure:
258269

259270
| timestamp | product_name | product_category | zip_code | order_total |
260271
| ------------------- | ------------- | ---------------- | -------- | ----------- |
261-
| 2023-01-01 10:00:00 | Plastic Chair | Furniture | 88523 | 2000 |
262272
| 2023-01-01 10:00:00 | Keyboard | Electronics | 88523 | 1000 |
263273
| 2023-01-01 10:00:00 | Mouse | Electronics | 88523 | 800 |
264-
| 2023-01-01 11:00:00 | Plastic Chair | Furniture | 88524 | 3000 |
274+
| 2023-01-01 10:00:00 | Plastic Chair | Furniture | 88523 | 2000 |
265275
| 2023-01-01 11:00:00 | Keyboard | Electronics | 88524 | 2000 |
276+
| 2023-01-01 11:00:00 | Plastic Chair | Furniture | 88524 | 3000 |
266277

267278
The pre-aggregation code would look as follows:
268279

@@ -416,13 +427,13 @@ cubes:
416427
417428
Then the data within `category_productname_zipcode_index` would look like:
418429

419-
| product_category | product_name | zip_code | timestamp | order_total |
420-
| ---------------- | ------------- | -------- | ------------------- | ----------- |
421-
| Furniture | Plastic Chair | 88523 | 2023-01-01 10:00:00 | 2000 |
422-
| Electronics | Keyboard | 88523 | 2023-01-01 10:00:00 | 1000 |
423-
| Electronics | Mouse | 88523 | 2023-01-01 10:00:00 | 800 |
424-
| Furniture | Plastic Chair | 88524 | 2023-01-01 11:00:00 | 3000 |
425-
| Electronics | Keyboard | 88524 | 2023-01-01 11:00:00 | 2000 |
430+
| product_category | zip_code | product_name | timestamp | order_total |
431+
| ---------------- | -------- | ------------- | ------------------- | ----------- |
432+
| Electronics | 88523 | Mouse | 2023-01-01 10:00:00 | 800 |
433+
| Electronics | 88523 | Plastic Chair | 2023-01-01 10:00:00 | 2000 |
434+
| Furniture | 88523 | Keyboard | 2023-01-01 10:00:00 | 1000 |
435+
| Electronics | 88524 | Keyboard | 2023-01-01 11:00:00 | 2000 |
436+
| Furniture | 88524 | Plastic Chair | 2023-01-01 11:00:00 | 3000 |
426437

427438
`product_category` column comes first as it's a single value filter. Then
428439
`zip_code` as it's `GROUP BY` column. `product_name` comes last as it's a

0 commit comments

Comments
 (0)