Skip to content

Commit cb9698e

Browse files
authored
docs: Updates to multi-cluster deployments, deployment warm-up, pre-aggregation index suggestion, auto-suspension intervals (#7716)
* docs: Production multi-cluster deployments (update) * docs: Deployment warm-up * docs: Pre-aggregation index suggestion * docs: Update auto-suspension intervals
1 parent 2b7cc30 commit cb9698e

File tree

10 files changed

+467
-216
lines changed

10 files changed

+467
-216
lines changed

docs/pages/product/caching.mdx

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ We do not recommend changing the default **in-memory** caching configuration
3131
unless it is necessary. To speed up query performance, consider using
3232
**pre-aggregations**.
3333

34-
## Pre-Aggregations
34+
## Pre-aggregations
3535

3636
Pre-aggregations is a layer of the aggregated data built and refreshed by Cube.
3737
It can dramatically improve the query performance and provide a higher
@@ -106,7 +106,7 @@ cube(`orders`, {
106106

107107
</CodeTabs>
108108

109-
## In-memory Cache
109+
## In-memory cache
110110

111111
Cube caches the results of executed queries using in-memory cache. The cache key
112112
is a generated SQL statement with any existing query-dependent pre-aggregations.
@@ -253,6 +253,34 @@ timestamp, and the time spent to build the pre-aggregation. You can also inspect
253253
every pre-aggregation's details: the list of queries it serves and all its
254254
versions.
255255

256+
### Cache type
257+
258+
Any query that is fulfilled by Cube will use one of the following cache types:
259+
260+
- **[Pre-aggregations](#pre-aggregations) in Cube Store.** This is the most
261+
advantageous and performant option.
262+
- **Pre-aggregations in Cube Store with a suboptimal query plan.** This cache
263+
type indicates that queries still benefit from pre-aggregations in Cube Store
264+
but it's possible to get a performance boost by [using indexes][ref-indexes].
265+
- **Pre-aggregations in the data source.** This cache type indicates that
266+
queries don't benefit from pre-aggregations in Cube Store and it's possible
267+
to get a massive performance boost by using Cube Store as [pre-aggregation
268+
storage][ref-storage].
269+
- **[In-memory cache.](#in-memory-cache)** This cache type indicates that
270+
queries don't benefit from pre-aggregations at all. Queries directly hit the
271+
upstream data source and in-memory cache is used to speed up the execution of
272+
identical queries that arrive within a short period of time.
273+
- **No cache.** This cache type indicates queries that directly hit the
274+
upstream data source and have the worst performance possible.
275+
276+
In [Query History][ref-query-history] and throughout Cube Cloud, colored bolt
277+
icons are used to indicate the cache type. Also, [Performance
278+
Insights][ref-perf-insights] provide an overview of API requests by specific
279+
cache types.
280+
281+
<Screenshot src="https://ucarecdn.com/cd63c899-3f0d-444d-9476-7d60001ff113/"/>
282+
283+
256284
[link-cube-cloud]: https://cube.dev/cloud
257285
[ref-config-preagg-schema]:
258286
/reference/configuration/config#preaggregationsschema
@@ -264,3 +292,7 @@ versions.
264292
[ref-schema-ref-cube-refresh-key]:
265293
/reference/data-model/cube#refresh_key
266294
[ref-schema-ref-preaggs]: /reference/data-model/pre-aggregations
295+
[ref-query-history]: /product/workspace/query-history#inspecting-api-queries
296+
[ref-perf-insights]: /product/workspace/performance#cache-type
297+
[ref-indexes]: /product/caching/using-pre-aggregations#using-indexes
298+
[ref-storage]: /product/caching/using-pre-aggregations#pre-aggregations-storage

docs/pages/product/caching/using-pre-aggregations.mdx

Lines changed: 105 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ configuration options to consider. Please make sure to also check [the
1010
Pre-Aggregations reference in the data modeling
1111
section][ref-schema-ref-preaggs].
1212

13-
## Refresh Strategy
13+
## Refresh strategy
1414

1515
Refresh strategy can be customized by setting the
1616
[`refresh_key`][ref-schema-ref-preaggs-refresh-key] property for the
@@ -144,7 +144,7 @@ When `every` and `sql` are used together, Cube will run the query from the `sql`
144144
property on an interval defined by the `every` property. If the query returns
145145
new results, then the pre-aggregation will be refreshed.
146146

147-
## Rollup Only Mode
147+
## Rollup-only mode
148148

149149
To make Cube _only_ serve requests from pre-aggregations, the
150150
[`CUBEJS_ROLLUP_ONLY`][ref-config-env-rolluponly] environment variable can be
@@ -240,47 +240,66 @@ Alternatively, if you want to explicitly introduce key partitioning, you can use
240240
Each orchestrator ID can use a different pre-aggregation schema, so you may define those based on the partitioning key you want to introduce.
241241
This technique, together with multi-router Cube Store approach, allows you to achieve linear scaling on the partitioning key of your choice.
242242

243-
## Using Indexes
243+
## Using indexes
244+
245+
[Indexes][ref-ref-indexes] are sorted copies of pre-aggregation data.
246+
247+
**When you define a pre-aggregation without any explicit indexes, the default
248+
index is created.** In this index, dimensions come first, time dimensions come
249+
second, and measures come last.
250+
251+
When you define additional indexes, you don't incur any additional costs on
252+
the data warehouse side. However, the pre-aggregation build time for a
253+
particular pre-aggregation increases with each index.
244254

245255
### When to use indexes?
246256

247-
When you define pre-aggregation without any indexes, the default index will be created.
248-
For the default index, dimensions come first, time dimensions come second, and measures come last.
249-
At query time, if the default index can't be selected for merge sort scan, then hash aggregation would be used.
250-
It usually means that the full table needs to be scanned to get query results.
251-
And it's usually no big deal if the pre-aggregation table is only several MB in size.
252-
Once you go over, indexes are usually required to achieve optimal performance.
253-
Especially if not all columns from pre-aggregation are used in a particular query.
254-
You can read more about indexes [here][ref-schema-ref-preaggs-index].
255-
256-
### Best Practices
257-
258-
To maximize performance, you can introduce an index per type of query so the set
259-
of dimensions used in the query overlap as much as possible with the ones
260-
defined in the index.
261-
As indexes are sorted copies of the data, you don't incur any additional costs on the data warehouse side, however, you multiply your build time for a given pre-aggregation with every index added.
262-
Measures are traditionally only used in indexes if you
263-
plan to filter a measured value and the cardinality of the possible values of
264-
the measure is low.
265-
266-
The order in which columns are specified in the index is **very** important;
257+
At query time, if the default index can't be selected for a merge sort scan,
258+
then a less performant hash aggregation would be used. It usually means that
259+
the full table needs to be scanned to get query results.
260+
261+
It usually doesn't make much difference if the pre-aggregation table is only
262+
several MBs in size. However, for larger pre-aggregations, indexes are usually
263+
required to achieve optimal performance, especially if not all dimensions from
264+
a pre-aggregation are used in a particular query.
265+
266+
### Best practices
267+
268+
Most pre-aggregations represent [additive][ref-additivity] rollups. For such
269+
rollups, **the rule of thumb is that, for most queries, there should be
270+
at least one index that makes a particular query scan very little amount of
271+
data,** which makes it very fast. (There are exceptions to this rule like
272+
top-k queries or queries with only low selectivity range filters. Optimization
273+
for these use cases usually involves remodeling data and queries.)
274+
275+
To maximize performance, you can introduce an index per each query type so
276+
that the set of dimensions used in a query overlaps as much as possible with
277+
the set of dimensions in the index. Measures are usually only used in indexes
278+
if you plan to filter on a measure value and the cardinality of the possible
279+
values of the measure is low.
280+
281+
The order in which dimensions are specified in the index is **very** important;
267282
suboptimal ordering can lead to diminished performance. To improve the
268-
performance of an index the main thing to consider is the order of the columns
269-
defined in it.
283+
performance of an index the main thing to consider is its order of dimensions.
284+
The rule of thumb for dimension order is as follows:
270285

271-
The key property of additive rollups is that for most queries, there's at least one index that makes a particular query scan very little amount of data which makes it very fast.
272-
There however exceptions to this rule like TopK queries, use of low selectivity range filters without high selectivity single value filters, etc.
273-
Optimization of those use cases usually should be handled by remodeling data and queries.
286+
- Dimensions used in high selectivity, single-value filters come first.
287+
- Dimensions used in `GROUP BY` come second.
288+
- Everything else used in the query comes in the end, including dimensions
289+
used in low selectivity, multiple-value filters.
274290

275-
The rule of thumb for index column order is:
291+
It might sound counter-intuitive to have dimensions used in `GROUP BY` before
292+
dimensions used in multiple-value filters. However, Cube Store always performs
293+
scans on sorted data, and if `GROUP BY` matches index ordering, merge
294+
sort-based algorithms are used for querying, which are usually much faster
295+
than hash-based `GROUP BY` in case index ordering doesn't match the query.
276296

277-
- Single value filters come first
278-
- `GROUP BY` columns come second
279-
- Everything else used in the query comes afterward
297+
If in doubt, always [use `EXPLAIN` and `EXPLAIN ANALYZE`](#explain-queries)
298+
to figure out the final query plan.
280299

281-
**Example:**
300+
#### Example
282301

283-
Suppose you have a pre-aggregation that has millions of rows with the following
302+
Suppose you have a pre-aggregation that has millions of rows and the following
284303
structure:
285304

286305
| timestamp | product_name | product_category | zip_code | order_total |
@@ -291,7 +310,7 @@ structure:
291310
| 2023-01-01 11:00:00 | Keyboard | Electronics | 88524 | 2000 |
292311
| 2023-01-01 11:00:00 | Plastic Chair | Furniture | 88524 | 3000 |
293312

294-
The pre-aggregation code would look as follows:
313+
The pre-aggregation definition looks as follows:
295314

296315
<CodeTabs>
297316

@@ -352,10 +371,10 @@ cubes:
352371
353372
</CodeTabs>
354373
355-
You run the following query on a regular basis, with the only difference between
356-
queries being the filter values:
374+
You run the following query on a regular basis, with the only difference
375+
between queries being the filter values:
357376
358-
```JSON
377+
```json
359378
{
360379
"measures": [
361380
"orders.order_total"
@@ -397,15 +416,15 @@ queries being the filter values:
397416
}
398417
```
399418

400-
After running this on a dataset with millions of records you find that it's
401-
taking a long time to run, so you decide to add an index to target this specific
402-
query. Taking into account the best practices mentioned previously you should
403-
define an index as follows:
419+
After running this query on a dataset with millions of records you find that
420+
it's taking too long to run, so you decide to add an index to target this
421+
specific query. Taking into account the best practices, you should define an
422+
index as follows:
404423

405424
<CodeTabs>
406425

407426
```javascript
408-
cube("orders", {
427+
cube(`orders`, {
409428
// ...
410429

411430
pre_aggregations: {
@@ -414,7 +433,11 @@ cube("orders", {
414433

415434
indexes: {
416435
category_productname_zipcode_index: {
417-
columns: [product_category, zip_code, product_name],
436+
columns: [
437+
product_category,
438+
zip_code,
439+
product_name
440+
],
418441
},
419442
},
420443
},
@@ -441,7 +464,15 @@ cubes:
441464
442465
</CodeTabs>
443466
444-
Then the data within `category_productname_zipcode_index` would look like:
467+
Here's why:
468+
469+
- The `product_category` dimension comes first as it's used in a single-value
470+
filter.
471+
- Then, the `zip_code` dimension comes second as it's used in `GROUP BY`.
472+
- The `product_name` dimension comes last as it's used in a multiple-value
473+
filter.
474+
475+
The data within `category_productname_zipcode_index` would look as follows:
445476

446477
| product_category | zip_code | product_name | timestamp | order_total |
447478
| ---------------- | -------- | ------------- | ------------------- | ----------- |
@@ -451,23 +482,30 @@ Then the data within `category_productname_zipcode_index` would look like:
451482
| Electronics | 88524 | Keyboard | 2023-01-01 11:00:00 | 2000 |
452483
| Furniture | 88524 | Plastic Chair | 2023-01-01 11:00:00 | 3000 |
453484

454-
`product_category` column comes first as it's a single value filter. Then
455-
`zip_code` as it's `GROUP BY` column. `product_name` comes last as it's a
456-
multiple value filter.
485+
### Aggregating indexes
486+
487+
Aggregating indexes can be defined as well. Such indexes contain **only**
488+
dimensions and pre-aggregated measures from the pre-aggregation definition.
489+
490+
Queries with the following characteristics can target aggregating indexes:
457491

458-
It might sound counter-intuitive to have `GROUP BY` columns before filter ones,
459-
however Cube Store always performs scans on sorted data, and if `GROUP BY`
460-
matches index ordering, merge sort-based algorithms are used for querying, which
461-
are usually much faster than hash-based group by in case of index ordering
462-
doesn't match the query. If in doubt, always use `EXPLAIN` and `EXPLAIN ANALYZE`
463-
in Cube Store to figure out the final query plan.
492+
- They cannot make use of any `filters` other than for dimensions that are
493+
included in that index.
494+
- **All** dimensions used in the query must be defined in the aggregating
495+
index.
464496

465-
### Aggregated indexes
497+
Queries that do not have the characteristics above can still make use of
498+
regular indexes so that their performance can still be optimized.
466499

467-
Aggregated indexes can be defined as well. You can read more about them
468-
[here][ref-schema-ref-preaggs-index].
500+
**In other words, an aggregating index is a rollup of data in a rollup table.**
501+
Data needs to be downloaded from the upstream data source as many times as
502+
many pre-aggregations you have. Compared to having multiple pre-aggregations,
503+
having a single pre-aggregation with multiple aggregating indexes gives you
504+
pretty much the same performance from the Cube Store side but multiple times
505+
less cost from a data warehouse side.
469506

470-
Example:
507+
Aggregating indexes are defined by using the [`type` option][ref-ref-index-type]
508+
in the index definition:
471509

472510
<CodeTabs>
473511

@@ -512,20 +550,20 @@ cubes:
512550
513551
</CodeTabs>
514552
515-
And the data for `zip_code_index` would look like the following:
553+
The data for `zip_code_index` would look as follows:
516554

517555
| zip_code | order_total |
518556
| -------- | ----------- |
519557
| 88523 | 3800 |
520558
| 88524 | 5000 |
521559

522-
## Inspecting Pre-Aggregations
560+
## Inspecting pre-aggregations
523561

524562
Cube Store partially supports the MySQL protocol. This allows you to execute
525563
simple queries using a familiar SQL syntax. You can connect using the MySQL CLI
526564
client, for example:
527565

528-
```bash{promptUser: user}
566+
```bash
529567
mysql -h <CUBESTORE_IP> --user=cubestore -pcubestore
530568
```
531569

@@ -558,7 +596,7 @@ SELECT * FROM information_schema.tables;
558596
These pre-aggregations are stored as Parquet files under the `.cubestore/`
559597
folder in the project root during development.
560598

561-
### EXPLAIN queries
599+
### `EXPLAIN` queries
562600

563601
Cube Store's MySQL protocol also supports `EXPLAIN` and `EXPLAIN ANALYZE`
564602
queries both of which are useful for determining how much processing a query
@@ -610,7 +648,7 @@ Sometimes, there can be exceptions to this rule.
610648
For example, a total count query run on top of the index will perform `HashAggregate` strategy on top of `MergeSort` nodes even if all required indexes are in place.
611649
This query would be optimal as well.
612650

613-
## Pre-Aggregations Storage
651+
## Pre-aggregations storage
614652

615653
The default pre-aggregations storage in Cube is its own purpose-built storage
616654
layer: Cube Store.
@@ -800,7 +838,7 @@ With all of the above set up, making a query such as the following will now use
800838
}
801839
```
802840

803-
## Pre-Aggregation Build Strategies
841+
## Pre-Aggregation build strategies
804842

805843
<InfoBox>
806844

@@ -953,3 +991,6 @@ streaming engine.
953991
[self-batching]: #batching
954992
[self-export-bucket]: #export-bucket
955993
[wiki-partitioning]: https://en.wikipedia.org/wiki/Partition_(database)
994+
[ref-ref-indexes]: /reference/data-model/pre-aggregations#indexes
995+
[ref-additivity]: /product/caching/getting-started-pre-aggregations#additivity
996+
[ref-ref-index-type]: /reference/data-model/pre-aggregations#type-1

docs/pages/product/deployment/cloud.mdx

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,8 @@ In Cube Cloud, you can:
3535
API endpoints for the source code in the main branch, any other branch,
3636
or any user-specific [development mode][ref-dev-mode] branch.
3737
* Assign a [custom domain][ref-domains] to API endpoints of any deployment.
38-
* Review [performance insights][ref-performance] and fine-tune deployments for better [scalability][ref-scalability].
38+
* Review [performance insights][ref-performance], use the [deployment warm-up][ref-warmup],
39+
and fine-tune deployments for better [scalability][ref-scalability].
3940
* Set up account-wide [budgets][ref-budgets] to control resource consumption
4041
and use [auto-suspension][ref-auto-sus] to reduce resource consumption of
4142
non-production deployments.
@@ -51,6 +52,7 @@ In Cube Cloud, you can:
5152
[ref-cd]: /product/deployment/cloud/continuous-deployment
5253
[ref-dev-mode]: /product/workspace/dev-mode
5354
[ref-domains]: /product/deployment/cloud/custom-domains
55+
[ref-warmup]: /product/deployment/cloud/warm-up
5456
[ref-auto-sus]: /product/deployment/cloud/auto-suspension
5557
[ref-budgets]: /product/workspace/budgets
5658
[ref-performance]: /product/workspace/performance

docs/pages/product/deployment/cloud/_meta.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ module.exports = {
33
"deployment-types": "Deployment types",
44
"continuous-deployment": "Continuous deployment",
55
"custom-domains": "Custom domains",
6+
"warm-up": "Deployment warm-up",
67
"auto-suspension": "Auto-suspension",
78
"scalability": "Scalability",
89
"pricing": "Pricing",

0 commit comments

Comments
 (0)