You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/concepts/why-clickhouse-is-so-fast.md
+16-1Lines changed: 16 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,12 +21,18 @@ To avoid that too many parts accumulate, ClickHouse runs a [merge](/merges) oper
21
21
22
22
This approach has several advantages: All data processing can be [offloaded to background part merges](/concepts/why-clickhouse-is-so-fast#storage-layer-merge-time-computation), keeping data writes lightweight and highly efficient. Individual inserts are "local" in the sense that they do not need to update global, i.e. per-table data structures. As a result, multiple simultaneous inserts need no mutual synchronization or synchronization with existing table data, and thus inserts can be performed almost at the speed of disk I/O.
23
23
24
+
the holistic performance optimization section of the VLDB paper.
25
+
26
+
🤿 Deep dive into this in the [On-Disk Format](/docs/academic_overview#3-1-on-disk-format) section of the web version of our VLDB 2024 paper.
27
+
24
28
## Storage Layer: Concurrent inserts and selects are isolated {#storage-layer-concurrent-inserts-and-selects-are-isolated}
25
29
26
30
<iframewidth="768"height="432"src="https://www.youtube.com/embed/dvGlPh2bJFo?si=F3MSALPpe0gAoq5k"title="YouTube video player"frameborder="0"allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"referrerpolicy="strict-origin-when-cross-origin"allowfullscreen></iframe>
27
31
28
32
Inserts are fully isolated from SELECT queries, and merging inserted data parts happens in the background without affecting concurrent queries.
29
33
34
+
🤿 Deep dive into this in the [Storage Layer](/docs/academic_overview#3-storage-layer) section of the web version of our VLDB 2024 paper.
<iframewidth="768"height="432"src="https://www.youtube.com/embed/_w3zQg695c0?si=g0Wa_Petn-LcmC-6"title="YouTube video player"frameborder="0"allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"referrerpolicy="strict-origin-when-cross-origin"allowfullscreen></iframe>
@@ -45,6 +51,8 @@ On the one hand, user queries may become significantly faster, sometimes by 1000
45
51
46
52
On the other hand, the majority of the runtime of merges is consumed by loading the input parts and saving the output part. The additional effort to transform the data during merge does usually not impact the runtime of merges too much. All of this magic is completely transparent and does not affect the result of queries (besides their performance).
47
53
54
+
🤿 Deep dive into this in the [Merge-time Data Transformation](/docs/academic_overview#3-3-merge-time-data-transformation) section of the web version of our VLDB 2024 paper.
55
+
48
56
## Storage Layer: Data pruning {#storage-layer-data-pruning}
49
57
50
58
<iframewidth="768"height="432"src="https://www.youtube.com/embed/UJpVAx7o1aY?si=w-AfhBcRIO-e3Ysj"title="YouTube video player"frameborder="0"allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"referrerpolicy="strict-origin-when-cross-origin"allowfullscreen></iframe>
@@ -59,6 +67,8 @@ In practice, many queries are repetitive, i.e., run unchanged or only with sligh
59
67
60
68
All three techniques aim to skip as many rows during full-column reads as possible because the fastest way to read data is to not read it at all.
61
69
70
+
🤿 Deep dive into this in the [Data Pruning](/docs/academic_overview#3-2-data-pruning) section of the web version of our VLDB 2024 paper.
71
+
62
72
## Storage Layer: Data compression {#storage-layer-data-compression}
63
73
64
74
<iframewidth="768"height="432"src="https://www.youtube.com/embed/MH10E3rVvnM?si=duWmS_OatCLx-akH"title="YouTube video player"frameborder="0"allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"referrerpolicy="strict-origin-when-cross-origin"allowfullscreen></iframe>
@@ -71,6 +81,8 @@ Users can [specify](https://clickhouse.com/blog/optimize-clickhouse-codecs-compr
71
81
72
82
Data compression not only reduces the storage size of the database tables, but in many cases, it also improves query performance as local disks and network I/O are often constrained by low throughput.
73
83
84
+
🤿 Deep dive into this in the [On-Disk Format](/docs/academic_overview#3-1-on-disk-format) section of the web version of our VLDB 2024 paper.
Finally, ClickHouse uses a vectorized query processing layer that parallelizes query execution as much as possible to utilize all resources for maximum speed and efficiency.
@@ -81,6 +93,8 @@ Modern systems have dozens of CPU cores. To utilize all cores, ClickHouse unfold
81
93
82
94
If a single node becomes too small to hold the table data, further nodes can be added to form a cluster. Tables can be split ("sharded") and distributed across the nodes. ClickHouse will run queries on all nodes that store table data and thereby scale "horizontally" with the number of available nodes.
83
95
96
+
🤿 Deep dive into this in the [Query Processing Layer](/academic_overview#4-query-processing-layer) section of the web version of our VLDB 2024 paper.
97
+
84
98
## Meticulous attention to detail {#meticulous-attention-to-detail}
85
99
86
100
> **"ClickHouse is a freak system - you guys have 20 versions of a hash table. You guys have all these amazing things where most systems will have one hash table****…****ClickHouse has this amazing performance because it has all these specialized components"**[Andy Pavlo, Database Professor at CMU](https://www.youtube.com/watch?v=Vy2t_wZx4Is&t=3579s)
@@ -109,14 +123,15 @@ The [hash table implementation in ClickHouse](https://clickhouse.com/blog/hash-t
109
123
110
124
Algorithms that rely on data characteristics often perform better than their generic counterparts. If the data characteristics are not known in advance, the system can try various implementations and choose the one that works best at runtime. For an example, see the [article on how LZ4 decompression is implemented in ClickHouse](https://habr.com/en/company/yandex/blog/457612/).
111
125
126
+
🤿 Deep dive into this in the [Holistic Performance Optimization](/academic_overview#4-4-holistic-performance-optimization) section of the web version of our VLDB 2024 paper.
112
127
113
128
## VLDB 2024 paper {#vldb-2024-paper}
114
129
115
130
In August 2024, we had our first research paper accepted and published at VLDB.
116
131
VLDB in an international conference on very large databases, and is widely regarded as one of the leading conferences in the field of data management.
117
132
Among the hundreds of submissions, VLDB generally has an acceptance rate of ~20%.
118
133
119
-
You can read a [PDF of the paper](https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf), which gives a concise description of ClickHouse's most interesting architectural and system design components that make it so fast.
134
+
You can read a [PDF of the paper](https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf) or our [web version](/docs/academic_overview) of it, which gives a concise description of ClickHouse's most interesting architectural and system design components that make it so fast.
120
135
121
136
Alexey Milovidov, our CTO and the creator of ClickHouse, presented the paper (slides [here](https://raw.githubusercontent.com/ClickHouse/clickhouse-presentations/master/2024-vldb/VLDB_2024_presentation.pdf)), followed by a Q&A (that quickly ran out of time!).
Copy file name to clipboardExpand all lines: docs/managing-data/core-concepts/academic_overview.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -88,7 +88,7 @@ Figure 3: Inserts and merges for MergeTree*-engine tables.
88
88
89
89
Compared to LSM trees [\[58\]](#page-13-7) and their implementation in various databases [\[13,](#page-12-6)[26,](#page-12-7)[56\]](#page-13-8), ClickHouse treats all parts as equal instead of arranging them in a hierarchy. As a result, merges are no longer limited to parts in the same level. Since this also forgoes the implicit chronological ordering of parts, alternative mechanisms for updates and deletes not based on tombstones are required (see Section [3.4)](#page-4-0). ClickHouse writes inserts directly to disk while other LSM-treebased stores typically use write-ahead logging (see Section [3.7)](#page-5-1).
90
90
91
-
A part corresponds to a directory on disk, containing one fle for each column. As an optimization, the columns of a small part (smaller than 10 MB by default) are stored consecutively in a single fle to increase the spatial locality for reads and writes. The rows of a part are further logically divided into groups of 8192 records, called granules. A granule represents the smallest indivisible data unit processed by the scan and index lookup operators in ClickHouse. Reads and writes of on-disk data are, however, not performed at the granule level but at the granularity of blocks, which combine multiple neighboring granules within a column. New blocks are formed based on a configurable byte size per block (by default 1 MB), i.e., the number of granules in a block is variable and depends on the column's data type and distribution. Blocks are furthermore compressed to reduce their size and I/O costs. By default, ClickHouse employs LZ4 [\[75\]](#page-13-9) as a general-purpose compression algorithm, but users can also specify specialized codecs like Gorilla [\[63\]](#page-13-10) or FPC [\[12\]](#page-12-8) for floating-point data. Compression algorithms can also be chained. For example, it is possible to first reduce logical redundancy in numeric values using delta coding [\[23\]](#page-12-9), then perform heavy-weight compression, and finally encrypt the data using an AES codec. Blocks are decompressed on-the-fy when they are loaded from disk into memory. To enable fast random access to individual granules despite compression, ClickHouse additionally stores for each column a mapping that associates every granule id with the offset of its containing compressed block in the column fle and the offset of the granule in the uncompressed block.
91
+
A part corresponds to a directory on disk, containing one file for each column. As an optimization, the columns of a small part (smaller than 10 MB by default) are stored consecutively in a single file to increase the spatial locality for reads and writes. The rows of a part are further logically divided into groups of 8192 records, called granules. A granule represents the smallest indivisible data unit processed by the scan and index lookup operators in ClickHouse. Reads and writes of on-disk data are, however, not performed at the granule level but at the granularity of blocks, which combine multiple neighboring granules within a column. New blocks are formed based on a configurable byte size per block (by default 1 MB), i.e., the number of granules in a block is variable and depends on the column's data type and distribution. Blocks are furthermore compressed to reduce their size and I/O costs. By default, ClickHouse employs LZ4 [\[75\]](#page-13-9) as a general-purpose compression algorithm, but users can also specify specialized codecs like Gorilla [\[63\]](#page-13-10) or FPC [\[12\]](#page-12-8) for floating-point data. Compression algorithms can also be chained. For example, it is possible to first reduce logical redundancy in numeric values using delta coding [\[23\]](#page-12-9), then perform heavy-weight compression, and finally encrypt the data using an AES codec. Blocks are decompressed on-the-fy when they are loaded from disk into memory. To enable fast random access to individual granules despite compression, ClickHouse additionally stores for each column a mapping that associates every granule id with the offset of its containing compressed block in the column file and the offset of the granule in the uncompressed block.
92
92
93
93
Columns can further be dictionary-encoded [\[2,](#page-12-10)[77,](#page-13-11)[81\]](#page-13-12) or made nullable using two special wrapper data types: LowCardinality(T) replaces the original column values by integer ids and thus significantly reduces the storage overhead for data with few unique values. Nullable(T) adds an internal bitmap to column T, representing whether column values are NULL or not.
94
94
@@ -336,7 +336,7 @@ The most similar databases to ClickHouse, in terms of goals and design principle
336
336
337
337
Snowfake [\[22\]](#page-12-37) is a popular proprietary cloud data warehouse based on a shared-disk architecture. Its approach of dividing tables into micro-partitions is similar to the concept of parts in ClickHouse. Snowfake uses hybrid PAX pages [\[3\]](#page-12-41) for persistence, whereas ClickHouse's storage format is strictly columnar. Snowfake also emphasizes local caching and data pruning using automatically created lightweight indexes [\[31,](#page-12-13)[51\]](#page-13-14) as a source for good performance. Similar to primary keys in ClickHouse, users may optionally create clustered indexes to co-locate data with the same values.
338
338
339
-
Photon [\[5\]](#page-12-39) and Velox [\[62\]](#page-13-32) are query execution engines designed to be used as components in complex data management systems. Both systems are passed query plans as input, which are then executed on the local node over Parquet (Photon) or Arrow (Velox) files [\[46\]](#page-13-34). ClickHouse is able to consume and generate data in these generic formats but prefers its native fle format for storage. While Velox and Photon do not optimize the query plan (Velox performs basic expression optimizations), they utilize runtime adaptivity techniques, such as dynamically switching compute kernels depending on the data characteristics. Similarly, plan operators in ClickHouse
339
+
Photon [\[5\]](#page-12-39) and Velox [\[62\]](#page-13-32) are query execution engines designed to be used as components in complex data management systems. Both systems are passed query plans as input, which are then executed on the local node over Parquet (Photon) or Arrow (Velox) files [\[46\]](#page-13-34). ClickHouse is able to consume and generate data in these generic formats but prefers its native file format for storage. While Velox and Photon do not optimize the query plan (Velox performs basic expression optimizations), they utilize runtime adaptivity techniques, such as dynamically switching compute kernels depending on the data characteristics. Similarly, plan operators in ClickHouse
340
340
341
341
can create other operators at runtime, primarily to switch to external aggregation or join operators, based on the query memory consumption. The Photon paper notes that code-generating designs [\[38,](#page-12-22)[41,](#page-12-42)[53\]](#page-13-0) are harder to develop and debug than interpreted vectorized designs [\[11\]](#page-12-0). The (experimental) support for code generation in Velox builds and links a shared library produced from runtime-generated C++ code, whereas ClickHouse interacts directly with LLVM's on-request compilation API.
0 commit comments