Skip to content

Commit e315e42

Browse files
authored
Merge pull request #3525 from ClickHouse/supported_badges
reduce sparse image sizes
2 parents ce07fa8 + 31e8f65 commit e315e42

File tree

1 file changed

+17
-17
lines changed

1 file changed

+17
-17
lines changed

docs/guides/best-practices/sparse-primary-indexes.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -331,7 +331,7 @@ ClickHouse is a <a href="https://clickhouse.com/docs/introduction/distinctive-fe
331331
- then by `URL`,
332332
- and lastly by `EventTime`:
333333

334-
<Image img={sparsePrimaryIndexes01} size="lg" alt="Sparse Primary Indices 01" background="white"/>
334+
<Image img={sparsePrimaryIndexes01} size="md" alt="Sparse Primary Indices 01" background="white"/>
335335

336336
`UserID.bin`, `URL.bin`, and `EventTime.bin` are the data files on disk where the values of the `UserID`, `URL`, and `EventTime` columns are stored.
337337

@@ -355,7 +355,7 @@ Column values are not physically stored inside granules: granules are just a log
355355
The following diagram shows how the (column values of) 8.87 million rows of our table
356356
are organized into 1083 granules, as a result of the table's DDL statement containing the setting `index_granularity` (set to its default value of 8192).
357357

358-
<Image img={sparsePrimaryIndexes02} size="lg" alt="Sparse Primary Indices 02" background="white"/>
358+
<Image img={sparsePrimaryIndexes02} size="md" alt="Sparse Primary Indices 02" background="white"/>
359359

360360
The first (based on physical order on disk) 8192 rows (their column values) logically belong to granule 0, then the next 8192 rows (their column values) belong to granule 1 and so on.
361361

@@ -390,7 +390,7 @@ For example
390390

391391
In total the index has 1083 entries for our table with 8.87 million rows and 1083 granules:
392392

393-
<Image img={sparsePrimaryIndexes03b} size="lg" alt="Sparse Primary Indices 03b" background="white"/>
393+
<Image img={sparsePrimaryIndexes03b} size="md" alt="Sparse Primary Indices 03b" background="white"/>
394394

395395
:::note
396396
- For tables with [adaptive index granularity](/whats-new/changelog/2019.md/#experimental-features-1), there is also one "final" additional mark stored in the primary index that records the values of the primary key columns of the last table row, but because we disabled adaptive index granularity (in order to simplify the discussions in this guide, as well as make the diagrams and results reproducible), the index of our example table doesn't include this final mark.
@@ -615,7 +615,7 @@ We discuss that second stage in more detail in the following section.
615615

616616
The following diagram illustrates a part of the primary index file for our table.
617617

618-
<Image img={sparsePrimaryIndexes04} size="lg" alt="Sparse Primary Indices 04" background="white"/>
618+
<Image img={sparsePrimaryIndexes04} size="md" alt="Sparse Primary Indices 04" background="white"/>
619619

620620
As discussed above, via a binary search over the index’s 1083 UserID marks, mark 176 was identified. Its corresponding granule 176 can therefore possibly contain rows with a UserID column value of 749.927.693.
621621

@@ -637,7 +637,7 @@ In ClickHouse the physical locations of all granules for our table are stored in
637637

638638
The following diagram shows the three mark files `UserID.mrk`, `URL.mrk`, and `EventTime.mrk` that store the physical locations of the granules for the table’s `UserID`, `URL`, and `EventTime` columns.
639639

640-
<Image img={sparsePrimaryIndexes05} size="lg" alt="Sparse Primary Indices 05" background="white"/>
640+
<Image img={sparsePrimaryIndexes05} size="md" alt="Sparse Primary Indices 05" background="white"/>
641641

642642
We have discussed how the primary index is a flat uncompressed array file (primary.idx), containing index marks that are numbered starting at 0.
643643

@@ -688,7 +688,7 @@ The indirection provided by mark files avoids storing, directly within the prima
688688

689689
The following diagram and the text below illustrate how for our example query ClickHouse locates granule 176 in the UserID.bin data file.
690690

691-
<Image img={sparsePrimaryIndexes06} size="lg" alt="Sparse Primary Indices 06" background="white"/>
691+
<Image img={sparsePrimaryIndexes06} size="md" alt="Sparse Primary Indices 06" background="white"/>
692692

693693
We discussed earlier in this guide that ClickHouse selected the primary index mark 176 and therefore granule 176 as possibly containing matching rows for our query.
694694

@@ -802,7 +802,7 @@ We have marked the key column values for the first table rows for each granule i
802802

803803
Suppose UserID had low cardinality. In this case it would be likely that the same UserID value is spread over multiple table rows and granules and therefore index marks. For index marks with the same UserID, the URL values for the index marks are sorted in ascending order (because the table rows are ordered first by UserID and then by URL). This allows efficient filtering as described below:
804804

805-
<Image img={sparsePrimaryIndexes07} size="lg" alt="Sparse Primary Indices 06" background="white"/>
805+
<Image img={sparsePrimaryIndexes07} size="md" alt="Sparse Primary Indices 06" background="white"/>
806806

807807
There are three different scenarios for the granule selection process for our abstract sample data in the diagram above:
808808

@@ -816,7 +816,7 @@ There are three different scenarios for the granule selection process for our ab
816816

817817
When the UserID has high cardinality then it is unlikely that the same UserID value is spread over multiple table rows and granules. This means the URL values for the index marks are not monotonically increasing:
818818

819-
<Image img={sparsePrimaryIndexes08} size="lg" alt="Sparse Primary Indices 06" background="white"/>
819+
<Image img={sparsePrimaryIndexes08} size="md" alt="Sparse Primary Indices 06" background="white"/>
820820

821821
As we can see in the diagram above, all shown marks whose URL values are smaller than W3 are getting selected for streaming its associated granule's rows into the ClickHouse engine.
822822

@@ -850,7 +850,7 @@ ALTER TABLE hits_UserID_URL MATERIALIZE INDEX url_skipping_index;
850850
```
851851
ClickHouse now created an additional index that is storing - per group of 4 consecutive [granules](#data-is-organized-into-granules-for-parallel-data-processing) (note the `GRANULARITY 4` clause in the `ALTER TABLE` statement above) - the minimum and maximum URL value:
852852

853-
<Image img={sparsePrimaryIndexes13a} size="lg" alt="Sparse Primary Indices 13a" background="white"/>
853+
<Image img={sparsePrimaryIndexes13a} size="md" alt="Sparse Primary Indices 13a" background="white"/>
854854

855855
The first index entry (‘mark 0’ in the diagram above) is storing the minimum and maximum URL values for the [rows belonging to the first 4 granules of our table](#data-is-organized-into-granules-for-parallel-data-processing).
856856

@@ -945,11 +945,11 @@ OPTIMIZE TABLE hits_URL_UserID FINAL;
945945

946946
Because we switched the order of the columns in the primary key, the inserted rows are now stored on disk in a different lexicographical order (compared to our [original table](#a-table-with-a-primary-key)) and therefore also the 1083 granules of that table are containing different values than before:
947947

948-
<Image img={sparsePrimaryIndexes10} size="lg" alt="Sparse Primary Indices 10" background="white"/>
948+
<Image img={sparsePrimaryIndexes10} size="md" alt="Sparse Primary Indices 10" background="white"/>
949949

950950
This is the resulting primary key:
951951

952-
<Image img={sparsePrimaryIndexes11} size="lg" alt="Sparse Primary Indices 11" background="white"/>
952+
<Image img={sparsePrimaryIndexes11} size="md" alt="Sparse Primary Indices 11" background="white"/>
953953

954954
That can now be used to significantly speed up the execution of our example query filtering on the URL column in order to calculate the top 10 users that most frequently clicked on the URL "http://public_search":
955955
```sql
@@ -1097,7 +1097,7 @@ Ok.
10971097
- if new rows are inserted into the source table hits_UserID_URL, then that rows are automatically also inserted into the implicitly created table
10981098
- Effectively the implicitly created table has the same row order and primary index as the [secondary table that we created explicitly](/guides/best-practices/sparse-primary-indexes#option-1-secondary-tables):
10991099

1100-
<Image img={sparsePrimaryIndexes12b1} size="lg" alt="Sparse Primary Indices 12b1" background="white"/>
1100+
<Image img={sparsePrimaryIndexes12b1} size="md" alt="Sparse Primary Indices 12b1" background="white"/>
11011101

11021102
ClickHouse is storing the [column data files](#data-is-stored-on-disk-ordered-by-primary-key-columns) (*.bin), the [mark files](#mark-files-are-used-for-locating-granules) (*.mrk2) and the [primary index](#the-primary-index-has-one-entry-per-granule) (primary.idx) of the implicitly created table in a special folder withing the ClickHouse server's data directory:
11031103

@@ -1181,7 +1181,7 @@ ALTER TABLE hits_UserID_URL
11811181
- please note that projections do not make queries that use ORDER BY more efficient, even if the ORDER BY matches the projection's ORDER BY statement (see https://github.com/ClickHouse/ClickHouse/issues/47333)
11821182
- Effectively the implicitly created hidden table has the same row order and primary index as the [secondary table that we created explicitly](/guides/best-practices/sparse-primary-indexes#option-1-secondary-tables):
11831183

1184-
<Image img={sparsePrimaryIndexes12c1} size="lg" alt="Sparse Primary Indices 12c1" background="white"/>
1184+
<Image img={sparsePrimaryIndexes12c1} size="md" alt="Sparse Primary Indices 12c1" background="white"/>
11851185

11861186
ClickHouse is storing the [column data files](#data-is-stored-on-disk-ordered-by-primary-key-columns) (*.bin), the [mark files](#mark-files-are-used-for-locating-granules) (*.mrk2) and the [primary index](#the-primary-index-has-one-entry-per-granule) (primary.idx) of the hidden table in a special folder (marked in orange in the screenshot below) next to the source table's data files, mark files, and primary index files:
11871187

@@ -1449,7 +1449,7 @@ In the following we illustrate why it's beneficial for the compression ratio of
14491449

14501450
The diagram below sketches the on-disk order of rows for a primary key where the key columns are ordered by cardinality in ascending order:
14511451

1452-
<Image img={sparsePrimaryIndexes14a} size="lg" alt="Sparse Primary Indices 14a" background="white"/>
1452+
<Image img={sparsePrimaryIndexes14a} size="md" alt="Sparse Primary Indices 14a" background="white"/>
14531453

14541454
We discussed that [the table's row data is stored on disk ordered by primary key columns](#data-is-stored-on-disk-ordered-by-primary-key-columns).
14551455

@@ -1461,7 +1461,7 @@ and locality (the more similar the data is, the better the compression ratio is)
14611461

14621462
In contrast to the diagram above, the diagram below sketches the on-disk order of rows for a primary key where the key columns are ordered by cardinality in descending order:
14631463

1464-
<Image img={sparsePrimaryIndexes14b} size="lg" alt="Sparse Primary Indices 14b" background="white"/>
1464+
<Image img={sparsePrimaryIndexes14b} size="md" alt="Sparse Primary Indices 14b" background="white"/>
14651465

14661466
Now the table's rows are first ordered by their `ch` value, and rows that have the same `ch` value are ordered by their `cl` value.
14671467
But because the first key column `ch` has high cardinality, it is unlikely that there are rows with the same `ch` value. And because of that is is also unlikely that `cl` values are ordered (locally - for rows with the same `ch` value).
@@ -1504,7 +1504,7 @@ The following diagram shows
15041504
- the insert order of rows when the content changes (for example because of keystrokes typing the text into the text-area) and
15051505
- the on-disk order of the data from the inserted rows when the `PRIMARY KEY (hash)` is used:
15061506

1507-
<Image img={sparsePrimaryIndexes15a} size="lg" alt="Sparse Primary Indices 15a" background="white"/>
1507+
<Image img={sparsePrimaryIndexes15a} size="md" alt="Sparse Primary Indices 15a" background="white"/>
15081508

15091509
Because the `hash` column is used as the primary key column
15101510
- specific rows can be retrieved [very quickly](#the-primary-index-is-used-for-selecting-granules), but
@@ -1519,7 +1519,7 @@ The following diagram shows
15191519
- the insert order of rows when the content changes (for example because of keystrokes typing the text into the text-area) and
15201520
- the on-disk order of the data from the inserted rows when the compound `PRIMARY KEY (fingerprint, hash)` is used:
15211521

1522-
<Image img={sparsePrimaryIndexes15b} size="lg" alt="Sparse Primary Indices 15b" background="white"/>
1522+
<Image img={sparsePrimaryIndexes15b} size="md" alt="Sparse Primary Indices 15b" background="white"/>
15231523

15241524
Now the rows on disk are first ordered by `fingerprint`, and for rows with the same fingerprint value, their `hash` value determines the final order.
15251525

0 commit comments

Comments
 (0)