Skip to content

Commit db83dd9

Browse files
committed
Adding additional terms, applying tooltips to core concepts, converting to mdx
1 parent f1371bb commit db83dd9

File tree

21 files changed

+246
-179
lines changed

21 files changed

+246
-179
lines changed

docs/best-practices/partitioning_keys.mdx

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,12 @@ import partitions from '@site/static/images/bestpractices/partitions.png';
1212
import merges_with_partitions from '@site/static/images/bestpractices/merges_with_partitions.png';
1313

1414
:::note A data management technique
15-
Partitioning is primarily a data management technique and not a query optimization tool, and while it can improve performance in specific workloads, it should not be the first mechanism used to accelerate queries; the partitioning key must be chosen carefully, with a clear understanding of its implications, and only applied when it aligns with data life cycle needs or well-understood access patterns.
15+
Partitioning is primarily a data management technique and not a query optimization tool, and while it can improve performance in specific workloads, it should not be the first mechanism used to accelerate queries; the ^^partitioning key^^ must be chosen carefully, with a clear understanding of its implications, and only applied when it aligns with data life cycle needs or well-understood access patterns.
1616
:::
1717

1818
In ClickHouse, partitioning organizes data into logical segments based on a specified key. This is defined using the `PARTITION BY` clause at table creation time and is commonly used to group rows by time intervals, categories, or other business-relevant dimensions. Each unique value of the partitioning expression forms its own physical partition on disk, and ClickHouse stores data in separate ^^parts^^ for each of these values. Partitioning improves data management, simplifies retention policies, and can help with certain query patterns.
1919

20-
For example, consider the following UK price paid dataset table with a partitioning key of `toStartOfMonth(date)`.
20+
For example, consider the following UK price paid dataset table with a ^^partitioning key^^ of `toStartOfMonth(date)`.
2121

2222
```sql
2323
CREATE TABLE uk.uk_price_paid_simple_partitioned
@@ -46,22 +46,22 @@ With partitioning enabled, ClickHouse only [merges](/merges) data ^^parts^^ with
4646

4747
## Applications of partitioning {#applications-of-partitioning}
4848

49-
Partitioning is a powerful tool for managing large datasets in ClickHouse, especially in observability and analytics use cases. It enables efficient data life cycle operations by allowing entire partitions, often aligned with time or business logic, to be dropped, moved, or archived in a single metadata operation. This is significantly faster and less resource-intensive than row-level delete or copy operations. Partitioning also integrates cleanly with ClickHouse features like TTL and tiered storage, making it possible to implement retention policies or hot/cold storage strategies without custom orchestration. For example, recent data can be kept on fast SSD-backed storage, while older partitions are automatically moved to cheaper object storage.
49+
Partitioning is a powerful tool for managing large datasets in ClickHouse, especially in observability and analytics use cases. It enables efficient data life cycle operations by allowing entire partitions, often aligned with time or business logic, to be dropped, moved, or archived in a single metadata operation. This is significantly faster and less resource-intensive than row-level delete or copy operations. Partitioning also integrates cleanly with ClickHouse features like ^^TTL^^ and tiered storage, making it possible to implement retention policies or hot/cold storage strategies without custom orchestration. For example, recent data can be kept on fast SSD-backed storage, while older partitions are automatically moved to cheaper object storage.
5050

5151
While partitioning can improve query performance for some workloads, it can also negatively impact response time.
5252

53-
If the partitioning key is not in the primary key and you are filtering by it, users may see an improvement in query performance with partitioning. See [here](/partitions#query-optimization) for an example.
53+
If the ^^partitioning key^^ is not in the ^^primary key^^ and you are filtering by it, users may see an improvement in query performance with partitioning. See [here](/partitions#query-optimization) for an example.
5454

5555
Conversely, if queries need to query across partitions performance may be negatively impacted due to a higher number of total ^^parts^^. For this reason, users should understand their access patterns before considering partitioning a a query optimization technique.
5656

5757
In summary, users should primarily think of partitioning as a data management technique. For an example of managing data, see ["Managing Data"](/observability/managing-data) from the observability use-case guide and ["What are table partitions used for?"](/partitions#data-management) from Core Concepts - Table partitions.
5858

59-
## Choose a low cardinality partitioning key {#choose-a-low-cardinality-partitioning-key}
59+
## Choose a low cardinality ^^partitioning key^^ {#choose-a-low-cardinality-partitioning-key}
6060

6161
Importantly, a higher number of ^^parts^^ will negatively affect query performance. ClickHouse will therefore respond to inserts with a [“too many parts”](/knowledgebase/exception-too-many-parts) error if the number of ^^parts^^ exceeds specified limits either in [total](/operations/settings/merge-tree-settings#max_parts_in_total) or [per partition](/operations/settings/merge-tree-settings#parts_to_throw_insert).
6262

63-
Choosing the right **cardinality** for the partitioning key is critical. A high-cardinality partitioning key - where the number of distinct partition values is large - can lead to a proliferation of data ^^parts^^. Since ClickHouse does not merge ^^parts^^ across partitions, too many partitions will result in too many unmerged ^^parts^^, eventually triggering the “Too many ^^parts^^” error. [Merges are essential](/merges) for reducing storage fragmentation and optimizing query speed, but with high-cardinality partitions, that merge potential is lost.
63+
Choosing the right **cardinality** for the ^^partitioning key^^ is critical. A high-cardinality ^^partitioning key^^ - where the number of distinct partition values is large - can lead to a proliferation of data ^^parts^^. Since ClickHouse does not merge ^^parts^^ across partitions, too many partitions will result in too many unmerged ^^parts^^, eventually triggering the “Too many ^^parts^^” error. [Merges are essential](/merges) for reducing storage fragmentation and optimizing query speed, but with high-cardinality partitions, that merge potential is lost.
6464

65-
By contrast, a **low-cardinality partitioning key**—with fewer than 100 - 1,000 distinct values - is usually optimal. It enables efficient part merging, keeps metadata overhead low, and avoids excessive object creation in storage. In addition, ClickHouse automatically builds MinMax indexes on partition columns, which can significantly speed up queries that filter on those columns. For example, filtering by month when the table is partitioned by `toStartOfMonth(date)` allows the engine to skip irrelevant partitions and their ^^parts^^ entirely.
65+
By contrast, a **low-cardinality ^^partitioning key^^**—with fewer than 100 - 1,000 distinct values - is usually optimal. It enables efficient part merging, keeps metadata overhead low, and avoids excessive object creation in storage. In addition, ClickHouse automatically builds MinMax indexes on partition columns, which can significantly speed up queries that filter on those columns. For example, filtering by month when the table is partitioned by `toStartOfMonth(date)` allows the engine to skip irrelevant partitions and their ^^parts^^ entirely.
6666

67-
While partitioning can improve performance in some query patterns, it's primarily a data management feature. In many cases, querying across all partitions can be slower than using a non-partitioned table due to increased data fragmentation and more ^^parts^^ being scanned. Use partitioning judiciously, and always ensure that the chosen key is low-cardinality and aligns with your data life cycle policies (e.g., retention via TTL). If you're unsure whether partitioning is necessary, you may want to start without it and optimize later based on observed access patterns.
67+
While partitioning can improve performance in some query patterns, it's primarily a data management feature. In many cases, querying across all partitions can be slower than using a non-partitioned table due to increased data fragmentation and more ^^parts^^ being scanned. Use partitioning judiciously, and always ensure that the chosen key is low-cardinality and aligns with your data life cycle policies (e.g., retention via ^^TTL^^). If you're unsure whether partitioning is necessary, you may want to start without it and optimize later based on observed access patterns.

docs/concepts/glossary.md

Lines changed: 75 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,28 +11,100 @@ slug: /concepts/glossary
1111

1212
## Atomicity {#atomicity}
1313

14-
Atomicity ensures that a transaction (a series of database operations) is treated as a single, indivisible unit. This means that either all operations within the transaction occur, or none do. An example of an atomic transaction is transferring money from one bank account to another. If either step of the transfer fails, the transaction fails, and the money stays in the first account. Atomicity ensures no money is lost or created.
14+
Atomicity ensures that a transaction (a series of database operations) is treated as a single, indivisible unit. This means that either all operations within the transaction occur, or none do. An example of an atomic transaction is transferring money from one bank account to another. If either step of the transfer fails, the transaction fails, and the money stays in the first account. Atomicity ensures no money is lost or created.
15+
16+
## Block {#block}
17+
18+
A block is a logical unit for organizing data processing and storage. Each block contains columnar data which is processed together to enhance performance during query execution. By processing data in blocks, ClickHouse utilizes CPU cores efficiently by minimizing cache misses and facilitating vectorized execution. ClickHouse uses various compression algorithms, such as LZ4, ZSTD, and Delta, to compress data in blocks.
1519

1620
## Cluster {#cluster}
1721

1822
A collection of nodes (servers) that work together to store and process data.
1923

2024
## CMEK {#cmek}
2125

22-
Customer-managed encryption keys (CMEK) allow customers to use their key-management service (KMS) key to encrypt the ClickHouse disk data key and protect their data at rest.
26+
Customer-managed encryption keys (CMEK) allow customers to use their key-management service (KMS) key to encrypt the ClickHouse disk data key and protect their data at rest.
2327

2428
## Dictionary {#dictionary}
2529

2630
A dictionary is a mapping of key-value pairs that is useful for various types of reference lists. It is a powerful feature that allows for the efficient use of dictionaries in queries, which is often more efficient than using a `JOIN` with reference tables.
2731

32+
## Distributed table {#distributed-table}
33+
34+
A distributed table in ClickHouse is a special type of table that does not store data itself but provides a unified view for distributed query processing across multiple servers in a cluster.
35+
36+
## Granule {#granule}
37+
38+
A granule is a batch of rows in an uncompressed block. When reading data, ClickHouse accesses granules, but not individual rows, which enables faster data processing in analytical workloads. A granule contains 8192 rows by default. The primary index contains one entry per granule.
39+
40+
## Incremental materialized view {#incremental-materialized-view}
41+
42+
In ClickHouse is a type of materialized view that processes and aggregates data at insert time. When new data is inserted into the source table, the materialized view executes a predefined SQL aggregation query only on the newly inserted blocks and writes the aggregated results to a target table.
43+
44+
## Lightweight update {#lightweight-update}
45+
46+
A lightweight update in ClickHouse is an experimental feature that allows you to update rows in a table using standard SQL UPDATE syntax, but instead of rewriting entire columns or data parts (as with traditional mutations), it creates "patch parts" containing only the updated columns and rows. These updates are immediately visible in SELECT queries through patch application, but the physical data is only updated during subsequent merges.
47+
48+
## Materialized view {#materialized-view}
49+
50+
A materialized view in ClickHouse is a mechanism that automatically runs a query on data as it is inserted into a source table, storing the transformed or aggregated results in a separate target table for faster querying.
51+
52+
## MergeTree {#mergetree}
53+
54+
A MergeTree in ClickHouse is a table engine designed for high data ingest rates and large data volumes. It is the core storage engine in ClickHouse, providing features such as columnar storage, custom partitioning, sparse primary indexes, and support for background data merges.
55+
56+
## Mutation {#mutation}
57+
58+
A mutation in ClickHouse refers to an operation that modifies or deletes existing data in a table, typically using commands like ALTER TABLE ... UPDATE or ALTER TABLE ... DELETE. Mutations are implemented as asynchronous background processes that rewrite entire data parts affected by the change, rather than modifying rows in place.
59+
60+
## On-the-fly mutation {#on-the-fly-mutation}
61+
62+
On-the-fly mutations in ClickHouse are a mechanism that allows updates or deletes to be visible in subsequent SELECT queries immediately after the mutation is submitted, without waiting for the background mutation process to finish.
63+
2864
## Parts {#parts}
2965

3066
A physical file on a disk that stores a portion of the table's data. This is different from a partition, which is a logical division of a table's data that is created using a partition key.
3167

68+
## Partitioning key {#partitioning-key}
69+
70+
A partitioning key in ClickHouse is a SQL expression defined in the PARTITION BY clause when creating a table. It determines how data is logically grouped into partitions on disk. Each unique value of the partitioning key forms its own physical partition, allowing for efficient data management operations such as dropping, moving, or archiving entire partitions.
71+
72+
## Primary key {#primary-key}
73+
74+
In ClickHouse, a primary key determines the order in which data is stored on disk and is used to build a sparse index that speeds up query filtering. Unlike traditional databases, the primary key in ClickHouse does not enforce uniqueness—multiple rows can have the same primary key value.
75+
76+
## Projection {#projection}
77+
78+
A projection in ClickHouse is a hidden, automatically maintained table that stores data in a different order or with precomputed aggregations to speed up queries, especially those filtering on columns not in the main primary key.
79+
80+
## Refreshable materialized view {#refreshable-materialized-view}
81+
82+
Refreshable materialized view is a type of materialized view that periodically re-executes its query over the full dataset and stores the result in a target table. Unlike incremental materialized views, refreshable materialized views are updated on a schedule and can support complex queries, including JOINs and UNIONs, without restrictions.
83+
3284
## Replica {#replica}
3385

3486
A copy of the data stored in a ClickHouse database. You can have any number of replicas of the same data for redundancy and reliability. Replicas are used in conjunction with the ReplicatedMergeTree table engine, which enables ClickHouse to keep multiple copies of data in sync across different servers.
3587

3688
## Shard {#shard}
3789

38-
A subset of data. ClickHouse always has at least one shard for your data. If you do not split the data across multiple servers, your data will be stored in one shard. Sharding data across multiple servers can be used to divide the load if you exceed the capacity of a single server.
90+
A subset of data. ClickHouse always has at least one shard for your data. If you do not split the data across multiple servers, your data will be stored in one shard. Sharding data across multiple servers can be used to divide the load if you exceed the capacity of a single server.
91+
92+
## Skipping index {#skipping-index}
93+
94+
Skipping indices are used to store small amounts of metadata at the level of multiple consecutive granules which allows ClickHouse to avoid scanning irrelevant rows. Skipping indices provide a lightweight alternative to projections.
95+
96+
## Sorting key {#sorting-key}
97+
98+
In ClickHouse, a sorting key defines the physical order of rows on disk. If you do not specify a primary key, ClickHouse uses the sorting key as the primary key. If you specify both, the primary key must be a prefix of the sorting key.
99+
100+
## Sparse index {#sparse-index}
101+
102+
A type of indexing when the primary index contains one entry for a group of rows, rather than a single row. The entry that corresponds to a group of rows is referred to as a mark. With sparse indexes, ClickHouse first identifies groups of rows that potentially match the query and then processes them separately to find a match. Because of this, the primary index is small enough to be loaded into the memory.
103+
104+
## Table engine {#table-engine}
105+
106+
Table engines in ClickHouse determine how data is written, stored and accessed. MergeTree is the most common table engine, and allows quick insertion of large amounts of data which get processed in the background.
107+
108+
## TTL {#ttl}
109+
110+
Time To Live (TTL) is A ClickHouse feature that automatically moves, deletes, or rolls up columns or rows after a certain time period. This allows you to manage storage more efficiently because you can delete, move, or archive the data that you no longer need to access frequently.

0 commit comments

Comments
 (0)