Skip to content

Commit 7e66e1e

Browse files
authored
Merge pull request #3418 from ClickHouse/link_to_architecture_sections
Links to corresponding architecture deep dive sections.
2 parents bc52f5e + b77d217 commit 7e66e1e

File tree

2 files changed

+18
-3
lines changed

2 files changed

+18
-3
lines changed

docs/concepts/why-clickhouse-is-so-fast.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,18 @@ To avoid that too many parts accumulate, ClickHouse runs a [merge](/merges) oper
2121

2222
This approach has several advantages: All data processing can be [offloaded to background part merges](/concepts/why-clickhouse-is-so-fast#storage-layer-merge-time-computation), keeping data writes lightweight and highly efficient. Individual inserts are "local" in the sense that they do not need to update global, i.e. per-table data structures. As a result, multiple simultaneous inserts need no mutual synchronization or synchronization with existing table data, and thus inserts can be performed almost at the speed of disk I/O.
2323

24+
the holistic performance optimization section of the VLDB paper.
25+
26+
🤿 Deep dive into this in the [On-Disk Format](/docs/academic_overview#3-1-on-disk-format) section of the web version of our VLDB 2024 paper.
27+
2428
## Storage Layer: Concurrent inserts and selects are isolated {#storage-layer-concurrent-inserts-and-selects-are-isolated}
2529

2630
<iframe width="768" height="432" src="https://www.youtube.com/embed/dvGlPh2bJFo?si=F3MSALPpe0gAoq5k" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
2731

2832
Inserts are fully isolated from SELECT queries, and merging inserted data parts happens in the background without affecting concurrent queries.
2933

34+
🤿 Deep dive into this in the [Storage Layer](/docs/academic_overview#3-storage-layer) section of the web version of our VLDB 2024 paper.
35+
3036
## Storage Layer: Merge-time computation {#storage-layer-merge-time-computation}
3137

3238
<iframe width="768" height="432" src="https://www.youtube.com/embed/_w3zQg695c0?si=g0Wa_Petn-LcmC-6" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
@@ -45,6 +51,8 @@ On the one hand, user queries may become significantly faster, sometimes by 1000
4551

4652
On the other hand, the majority of the runtime of merges is consumed by loading the input parts and saving the output part. The additional effort to transform the data during merge does usually not impact the runtime of merges too much. All of this magic is completely transparent and does not affect the result of queries (besides their performance).
4753

54+
🤿 Deep dive into this in the [Merge-time Data Transformation](/docs/academic_overview#3-3-merge-time-data-transformation) section of the web version of our VLDB 2024 paper.
55+
4856
## Storage Layer: Data pruning {#storage-layer-data-pruning}
4957

5058
<iframe width="768" height="432" src="https://www.youtube.com/embed/UJpVAx7o1aY?si=w-AfhBcRIO-e3Ysj" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
@@ -59,6 +67,8 @@ In practice, many queries are repetitive, i.e., run unchanged or only with sligh
5967

6068
All three techniques aim to skip as many rows during full-column reads as possible because the fastest way to read data is to not read it at all.
6169

70+
🤿 Deep dive into this in the [Data Pruning](/docs/academic_overview#3-2-data-pruning) section of the web version of our VLDB 2024 paper.
71+
6272
## Storage Layer: Data compression {#storage-layer-data-compression}
6373

6474
<iframe width="768" height="432" src="https://www.youtube.com/embed/MH10E3rVvnM?si=duWmS_OatCLx-akH" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
@@ -71,6 +81,8 @@ Users can [specify](https://clickhouse.com/blog/optimize-clickhouse-codecs-compr
7181

7282
Data compression not only reduces the storage size of the database tables, but in many cases, it also improves query performance as local disks and network I/O are often constrained by low throughput.
7383

84+
🤿 Deep dive into this in the [On-Disk Format](/docs/academic_overview#3-1-on-disk-format) section of the web version of our VLDB 2024 paper.
85+
7486
## State-of-the-art query processing layer {#state-of-the-art-query-processing-layer}
7587

7688
Finally, ClickHouse uses a vectorized query processing layer that parallelizes query execution as much as possible to utilize all resources for maximum speed and efficiency.
@@ -81,6 +93,8 @@ Modern systems have dozens of CPU cores. To utilize all cores, ClickHouse unfold
8193

8294
If a single node becomes too small to hold the table data, further nodes can be added to form a cluster. Tables can be split ("sharded") and distributed across the nodes. ClickHouse will run queries on all nodes that store table data and thereby scale "horizontally" with the number of available nodes.
8395

96+
🤿 Deep dive into this in the [Query Processing Layer](/academic_overview#4-query-processing-layer) section of the web version of our VLDB 2024 paper.
97+
8498
## Meticulous attention to detail {#meticulous-attention-to-detail}
8599

86100
> **"ClickHouse is a freak system - you guys have 20 versions of a hash table. You guys have all these amazing things where most systems will have one hash table** **** **ClickHouse has this amazing performance because it has all these specialized components"** [Andy Pavlo, Database Professor at CMU](https://www.youtube.com/watch?v=Vy2t_wZx4Is&t=3579s)
@@ -109,14 +123,15 @@ The [hash table implementation in ClickHouse](https://clickhouse.com/blog/hash-t
109123

110124
Algorithms that rely on data characteristics often perform better than their generic counterparts. If the data characteristics are not known in advance, the system can try various implementations and choose the one that works best at runtime. For an example, see the [article on how LZ4 decompression is implemented in ClickHouse](https://habr.com/en/company/yandex/blog/457612/).
111125

126+
🤿 Deep dive into this in the [Holistic Performance Optimization](/academic_overview#4-4-holistic-performance-optimization) section of the web version of our VLDB 2024 paper.
112127

113128
## VLDB 2024 paper {#vldb-2024-paper}
114129

115130
In August 2024, we had our first research paper accepted and published at VLDB.
116131
VLDB in an international conference on very large databases, and is widely regarded as one of the leading conferences in the field of data management.
117132
Among the hundreds of submissions, VLDB generally has an acceptance rate of ~20%.
118133

119-
You can read a [PDF of the paper](https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf), which gives a concise description of ClickHouse's most interesting architectural and system design components that make it so fast.
134+
You can read a [PDF of the paper](https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf) or our [web version](/docs/academic_overview) of it, which gives a concise description of ClickHouse's most interesting architectural and system design components that make it so fast.
120135

121136
Alexey Milovidov, our CTO and the creator of ClickHouse, presented the paper (slides [here](https://raw.githubusercontent.com/ClickHouse/clickhouse-presentations/master/2024-vldb/VLDB_2024_presentation.pdf)), followed by a Q&A (that quickly ran out of time!).
122137
You can catch the recorded presentation here:

docs/managing-data/core-concepts/academic_overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ Figure 3: Inserts and merges for MergeTree*-engine tables.
8888

8989
Compared to LSM trees [\[58\]](#page-13-7) and their implementation in various databases [\[13,](#page-12-6) [26,](#page-12-7) [56\]](#page-13-8), ClickHouse treats all parts as equal instead of arranging them in a hierarchy. As a result, merges are no longer limited to parts in the same level. Since this also forgoes the implicit chronological ordering of parts, alternative mechanisms for updates and deletes not based on tombstones are required (see Section [3.4)](#page-4-0). ClickHouse writes inserts directly to disk while other LSM-treebased stores typically use write-ahead logging (see Section [3.7)](#page-5-1).
9090

91-
A part corresponds to a directory on disk, containing one fle for each column. As an optimization, the columns of a small part (smaller than 10 MB by default) are stored consecutively in a single fle to increase the spatial locality for reads and writes. The rows of a part are further logically divided into groups of 8192 records, called granules. A granule represents the smallest indivisible data unit processed by the scan and index lookup operators in ClickHouse. Reads and writes of on-disk data are, however, not performed at the granule level but at the granularity of blocks, which combine multiple neighboring granules within a column. New blocks are formed based on a configurable byte size per block (by default 1 MB), i.e., the number of granules in a block is variable and depends on the column's data type and distribution. Blocks are furthermore compressed to reduce their size and I/O costs. By default, ClickHouse employs LZ4 [\[75\]](#page-13-9) as a general-purpose compression algorithm, but users can also specify specialized codecs like Gorilla [\[63\]](#page-13-10) or FPC [\[12\]](#page-12-8) for floating-point data. Compression algorithms can also be chained. For example, it is possible to first reduce logical redundancy in numeric values using delta coding [\[23\]](#page-12-9), then perform heavy-weight compression, and finally encrypt the data using an AES codec. Blocks are decompressed on-the-fy when they are loaded from disk into memory. To enable fast random access to individual granules despite compression, ClickHouse additionally stores for each column a mapping that associates every granule id with the offset of its containing compressed block in the column fle and the offset of the granule in the uncompressed block.
91+
A part corresponds to a directory on disk, containing one file for each column. As an optimization, the columns of a small part (smaller than 10 MB by default) are stored consecutively in a single file to increase the spatial locality for reads and writes. The rows of a part are further logically divided into groups of 8192 records, called granules. A granule represents the smallest indivisible data unit processed by the scan and index lookup operators in ClickHouse. Reads and writes of on-disk data are, however, not performed at the granule level but at the granularity of blocks, which combine multiple neighboring granules within a column. New blocks are formed based on a configurable byte size per block (by default 1 MB), i.e., the number of granules in a block is variable and depends on the column's data type and distribution. Blocks are furthermore compressed to reduce their size and I/O costs. By default, ClickHouse employs LZ4 [\[75\]](#page-13-9) as a general-purpose compression algorithm, but users can also specify specialized codecs like Gorilla [\[63\]](#page-13-10) or FPC [\[12\]](#page-12-8) for floating-point data. Compression algorithms can also be chained. For example, it is possible to first reduce logical redundancy in numeric values using delta coding [\[23\]](#page-12-9), then perform heavy-weight compression, and finally encrypt the data using an AES codec. Blocks are decompressed on-the-fy when they are loaded from disk into memory. To enable fast random access to individual granules despite compression, ClickHouse additionally stores for each column a mapping that associates every granule id with the offset of its containing compressed block in the column file and the offset of the granule in the uncompressed block.
9292

9393
Columns can further be dictionary-encoded [\[2,](#page-12-10) [77,](#page-13-11) [81\]](#page-13-12) or made nullable using two special wrapper data types: LowCardinality(T) replaces the original column values by integer ids and thus significantly reduces the storage overhead for data with few unique values. Nullable(T) adds an internal bitmap to column T, representing whether column values are NULL or not.
9494

@@ -336,7 +336,7 @@ The most similar databases to ClickHouse, in terms of goals and design principle
336336

337337
Snowfake [\[22\]](#page-12-37) is a popular proprietary cloud data warehouse based on a shared-disk architecture. Its approach of dividing tables into micro-partitions is similar to the concept of parts in ClickHouse. Snowfake uses hybrid PAX pages [\[3\]](#page-12-41) for persistence, whereas ClickHouse's storage format is strictly columnar. Snowfake also emphasizes local caching and data pruning using automatically created lightweight indexes [\[31,](#page-12-13) [51\]](#page-13-14) as a source for good performance. Similar to primary keys in ClickHouse, users may optionally create clustered indexes to co-locate data with the same values.
338338

339-
Photon [\[5\]](#page-12-39) and Velox [\[62\]](#page-13-32) are query execution engines designed to be used as components in complex data management systems. Both systems are passed query plans as input, which are then executed on the local node over Parquet (Photon) or Arrow (Velox) files [\[46\]](#page-13-34). ClickHouse is able to consume and generate data in these generic formats but prefers its native fle format for storage. While Velox and Photon do not optimize the query plan (Velox performs basic expression optimizations), they utilize runtime adaptivity techniques, such as dynamically switching compute kernels depending on the data characteristics. Similarly, plan operators in ClickHouse
339+
Photon [\[5\]](#page-12-39) and Velox [\[62\]](#page-13-32) are query execution engines designed to be used as components in complex data management systems. Both systems are passed query plans as input, which are then executed on the local node over Parquet (Photon) or Arrow (Velox) files [\[46\]](#page-13-34). ClickHouse is able to consume and generate data in these generic formats but prefers its native file format for storage. While Velox and Photon do not optimize the query plan (Velox performs basic expression optimizations), they utilize runtime adaptivity techniques, such as dynamically switching compute kernels depending on the data characteristics. Similarly, plan operators in ClickHouse
340340

341341
can create other operators at runtime, primarily to switch to external aggregation or join operators, based on the query memory consumption. The Photon paper notes that code-generating designs [\[38,](#page-12-22) [41,](#page-12-42) [53\]](#page-13-0) are harder to develop and debug than interpreted vectorized designs [\[11\]](#page-12-0). The (experimental) support for code generation in Velox builds and links a shared library produced from runtime-generated C++ code, whereas ClickHouse interacts directly with LLVM's on-request compilation API.
342342

0 commit comments

Comments
 (0)