You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -10,7 +10,7 @@ configuration options to consider. Please make sure to also check [the
10
10
Pre-Aggregations reference in the data modeling
11
11
section][ref-schema-ref-preaggs].
12
12
13
-
## Refresh Strategy
13
+
## Refresh strategy
14
14
15
15
Refresh strategy can be customized by setting the
16
16
[`refresh_key`][ref-schema-ref-preaggs-refresh-key] property for the
@@ -144,7 +144,7 @@ When `every` and `sql` are used together, Cube will run the query from the `sql`
144
144
property on an interval defined by the `every` property. If the query returns
145
145
new results, then the pre-aggregation will be refreshed.
146
146
147
-
## Rollup Only Mode
147
+
## Rollup-only mode
148
148
149
149
To make Cube _only_ serve requests from pre-aggregations, the
150
150
[`CUBEJS_ROLLUP_ONLY`][ref-config-env-rolluponly] environment variable can be
@@ -240,47 +240,66 @@ Alternatively, if you want to explicitly introduce key partitioning, you can use
240
240
Each orchestrator ID can use a different pre-aggregation schema, so you may define those based on the partitioning key you want to introduce.
241
241
This technique, together with multi-router Cube Store approach, allows you to achieve linear scaling on the partitioning key of your choice.
242
242
243
-
## Using Indexes
243
+
## Using indexes
244
+
245
+
[Indexes][ref-ref-indexes] are sorted copies of pre-aggregation data.
246
+
247
+
**When you define a pre-aggregation without any explicit indexes, the default
248
+
index is created.** In this index, dimensions come first, time dimensions come
249
+
second, and measures come last.
250
+
251
+
When you define additional indexes, you don't incur any additional costs on
252
+
the data warehouse side. However, the pre-aggregation build time for a
253
+
particular pre-aggregation increases with each index.
244
254
245
255
### When to use indexes?
246
256
247
-
When you define pre-aggregation without any indexes, the default index will be created.
248
-
For the default index, dimensions come first, time dimensions come second, and measures come last.
249
-
At query time, if the default index can't be selected for merge sort scan, then hash aggregation would be used.
250
-
It usually means that the full table needs to be scanned to get query results.
251
-
And it's usually no big deal if the pre-aggregation table is only several MB in size.
252
-
Once you go over, indexes are usually required to achieve optimal performance.
253
-
Especially if not all columns from pre-aggregation are used in a particular query.
254
-
You can read more about indexes [here][ref-schema-ref-preaggs-index].
255
-
256
-
### Best Practices
257
-
258
-
To maximize performance, you can introduce an index per type of query so the set
259
-
of dimensions used in the query overlap as much as possible with the ones
260
-
defined in the index.
261
-
As indexes are sorted copies of the data, you don't incur any additional costs on the data warehouse side, however, you multiply your build time for a given pre-aggregation with every index added.
262
-
Measures are traditionally only used in indexes if you
263
-
plan to filter a measured value and the cardinality of the possible values of
264
-
the measure is low.
265
-
266
-
The order in which columns are specified in the index is **very** important;
257
+
At query time, if the default index can't be selected for a merge sort scan,
258
+
then a less performant hash aggregation would be used. It usually means that
259
+
the full table needs to be scanned to get query results.
260
+
261
+
It usually doesn't make much difference if the pre-aggregation table is only
262
+
several MBs in size. However, for larger pre-aggregations, indexes are usually
263
+
required to achieve optimal performance, especially if not all dimensions from
264
+
a pre-aggregation are used in a particular query.
265
+
266
+
### Best practices
267
+
268
+
Most pre-aggregations represent [additive][ref-additivity] rollups. For such
269
+
rollups, **the rule of thumb is that, for most queries, there should be
270
+
at least one index that makes a particular query scan very little amount of
271
+
data,** which makes it very fast. (There are exceptions to this rule like
272
+
top-k queries or queries with only low selectivity range filters. Optimization
273
+
for these use cases usually involves remodeling data and queries.)
274
+
275
+
To maximize performance, you can introduce an index per each query type so
276
+
that the set of dimensions used in a query overlaps as much as possible with
277
+
the set of dimensions in the index. Measures are usually only used in indexes
278
+
if you plan to filter on a measure value and the cardinality of the possible
279
+
values of the measure is low.
280
+
281
+
The order in which dimensions are specified in the index is **very** important;
267
282
suboptimal ordering can lead to diminished performance. To improve the
268
-
performance of an index the main thing to consider is the order of the columns
269
-
defined in it.
283
+
performance of an index the main thing to consider is its order of dimensions.
284
+
The rule of thumb for dimension order is as follows:
270
285
271
-
The key property of additive rollups is that for most queries, there's at least one index that makes a particular query scan very little amount of data which makes it very fast.
272
-
There however exceptions to this rule like TopK queries, use of low selectivity range filters without high selectivity single value filters, etc.
273
-
Optimization of those use cases usually should be handled by remodeling data and queries.
286
+
- Dimensions used in high selectivity, single-value filters come first.
287
+
- Dimensions used in `GROUP BY` come second.
288
+
- Everything else used in the query comes in the end, including dimensions
289
+
used in low selectivity, multiple-value filters.
274
290
275
-
The rule of thumb for index column order is:
291
+
It might sound counter-intuitive to have dimensions used in `GROUP BY` before
292
+
dimensions used in multiple-value filters. However, Cube Store always performs
293
+
scans on sorted data, and if `GROUP BY` matches index ordering, merge
294
+
sort-based algorithms are used for querying, which are usually much faster
295
+
than hash-based `GROUP BY` in case index ordering doesn't match the query.
276
296
277
-
- Single value filters come first
278
-
-`GROUP BY` columns come second
279
-
- Everything else used in the query comes afterward
297
+
If in doubt, always [use `EXPLAIN` and `EXPLAIN ANALYZE`](#explain-queries)
298
+
to figure out the final query plan.
280
299
281
-
**Example:**
300
+
#### Example
282
301
283
-
Suppose you have a pre-aggregation that has millions of rows with the following
302
+
Suppose you have a pre-aggregation that has millions of rows and the following
`product_category`column comes first as it's a single value filter. Then
455
-
`zip_code`as it's `GROUP BY` column. `product_name` comes last as it's a
456
-
multiple value filter.
485
+
### Aggregating indexes
486
+
487
+
Aggregating indexes can be defined as well. Such indexes contain **only**
488
+
dimensions and pre-aggregated measures from the pre-aggregation definition.
489
+
490
+
Queries with the following characteristics can target aggregating indexes:
457
491
458
-
It might sound counter-intuitive to have `GROUP BY` columns before filter ones,
459
-
however Cube Store always performs scans on sorted data, and if `GROUP BY`
460
-
matches index ordering, merge sort-based algorithms are used for querying, which
461
-
are usually much faster than hash-based group by in case of index ordering
462
-
doesn't match the query. If in doubt, always use `EXPLAIN` and `EXPLAIN ANALYZE`
463
-
in Cube Store to figure out the final query plan.
492
+
- They cannot make use of any `filters` other than for dimensions that are
493
+
included in that index.
494
+
- **All** dimensions used in the query must be defined in the aggregating
495
+
index.
464
496
465
-
### Aggregated indexes
497
+
Queries that do not have the characteristics above can still make use of
498
+
regular indexes so that their performance can still be optimized.
466
499
467
-
Aggregated indexes can be defined as well. You can read more about them
468
-
[here][ref-schema-ref-preaggs-index].
500
+
**In other words, an aggregating index is a rollup of data in a rollup table.**
501
+
Data needs to be downloaded from the upstream data source as many times as
502
+
many pre-aggregations you have. Compared to having multiple pre-aggregations,
503
+
having a single pre-aggregation with multiple aggregating indexes gives you
504
+
pretty much the same performance from the Cube Store side but multiple times
505
+
less cost from a data warehouse side.
469
506
470
-
Example:
507
+
Aggregating indexes are defined by using the [`type` option][ref-ref-index-type]
508
+
in the index definition:
471
509
472
510
<CodeTabs>
473
511
@@ -512,20 +550,20 @@ cubes:
512
550
513
551
</CodeTabs>
514
552
515
-
And the data for `zip_code_index` would look like the following:
553
+
The data for `zip_code_index` would look as follows:
516
554
517
555
| zip_code | order_total |
518
556
| -------- | ----------- |
519
557
| 88523 | 3800 |
520
558
| 88524 | 5000 |
521
559
522
-
## Inspecting Pre-Aggregations
560
+
## Inspecting pre-aggregations
523
561
524
562
Cube Store partially supports the MySQL protocol. This allows you to execute
525
563
simple queries using a familiar SQL syntax. You can connect using the MySQL CLI
526
564
client, for example:
527
565
528
-
```bash{promptUser: user}
566
+
```bash
529
567
mysql -h <CUBESTORE_IP> --user=cubestore -pcubestore
530
568
```
531
569
@@ -558,7 +596,7 @@ SELECT * FROM information_schema.tables;
558
596
These pre-aggregations are stored as Parquet files under the `.cubestore/`
559
597
folder in the project root during development.
560
598
561
-
### EXPLAIN queries
599
+
### `EXPLAIN` queries
562
600
563
601
Cube Store's MySQL protocol also supports `EXPLAIN` and `EXPLAIN ANALYZE`
564
602
queries both of which are useful for determining how much processing a query
@@ -610,7 +648,7 @@ Sometimes, there can be exceptions to this rule.
610
648
For example, a total count query run on top of the index will perform `HashAggregate` strategy on top of `MergeSort` nodes even if all required indexes are in place.
611
649
This query would be optimal as well.
612
650
613
-
## Pre-Aggregations Storage
651
+
## Pre-aggregations storage
614
652
615
653
The default pre-aggregations storage in Cube is its own purpose-built storage
616
654
layer: Cube Store.
@@ -800,7 +838,7 @@ With all of the above set up, making a query such as the following will now use
0 commit comments