Skip to content

Commit 780d3d2

Browse files
authored
Merge pull request #4027 from mneedham/merge-table-function-guide
merge table engine guide
2 parents ca91e6e + 0aef720 commit 780d3d2

File tree

3 files changed

+228
-0
lines changed

3 files changed

+228
-0
lines changed
Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
---
2+
slug: /guides/developer/merge-table-function
3+
sidebar_label: 'Merge table function'
4+
title: 'Merge table function'
5+
description: 'Query multiple tables at the same time.'
6+
---
7+
8+
The [merge table function](https://clickhouse.com/docs/sql-reference/table-functions/merge) lets us query multiple tables in parallel.
9+
It does this by creating a temporary [Merge](https://clickhouse.com/docs/engines/table-engines/special/merge) table and derives this table's structure by taking a union of their columns and by deriving common types.
10+
11+
<iframe width="768" height="432" src="https://www.youtube.com/embed/b4YfRhD9SSI?si=MuoDwDWeikAV5ttk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
12+
13+
## Setup tables {#setup-tables}
14+
15+
We're going to learn how to use this function with help from [Jeff Sackmann's tennis dataset](https://github.com/JeffSackmann/tennis_atp).
16+
We're going to process CSV files that contain matches going back to the 1960s, but we'll create a slightly different schema for each decade.
17+
We'll also add a couple of extra columns for the 1990s decade.
18+
19+
The import statements are shown below:
20+
21+
```sql
22+
CREATE OR REPLACE TABLE atp_matches_1960s ORDER BY tourney_id AS
23+
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, score
24+
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1968..1969}.csv')
25+
SETTINGS schema_inference_make_columns_nullable=0,
26+
schema_inference_hints='winner_seed Nullable(String), loser_seed Nullable(UInt8)';
27+
28+
CREATE OR REPLACE TABLE atp_matches_1970s ORDER BY tourney_id AS
29+
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score
30+
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1970..1979}.csv')
31+
SETTINGS schema_inference_make_columns_nullable=0,
32+
schema_inference_hints='winner_seed Nullable(UInt8), loser_seed Nullable(UInt8)';
33+
34+
CREATE OR REPLACE TABLE atp_matches_1980s ORDER BY tourney_id AS
35+
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score
36+
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1980..1989}.csv')
37+
SETTINGS schema_inference_make_columns_nullable=0,
38+
schema_inference_hints='winner_seed Nullable(UInt16), loser_seed Nullable(UInt16)';
39+
40+
CREATE OR REPLACE TABLE atp_matches_1990s ORDER BY tourney_id AS
41+
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score,
42+
toBool(arrayExists(x -> position(x, 'W/O') > 0, score))::Nullable(bool) AS walkover,
43+
toBool(arrayExists(x -> position(x, 'RET') > 0, score))::Nullable(bool) AS retirement
44+
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1990..1999}.csv')
45+
SETTINGS schema_inference_make_columns_nullable=0,
46+
schema_inference_hints='winner_seed Nullable(UInt16), loser_seed Nullable(UInt16), surface Enum(\'Hard\', \'Grass\', \'Clay\', \'Carpet\')';
47+
```
48+
49+
## Schema of multiple tables {#schema-multiple-tables}
50+
51+
We can run the following query to list the columns in each table along with their types side by side, so that it's easier to see the differences.
52+
53+
```sql
54+
SELECT * EXCEPT(position) FROM (
55+
SELECT position, name,
56+
any(if(table = 'atp_matches_1960s', type, null)) AS 1960s,
57+
any(if(table = 'atp_matches_1970s', type, null)) AS 1970s,
58+
any(if(table = 'atp_matches_1980s', type, null)) AS 1980s,
59+
any(if(table = 'atp_matches_1990s', type, null)) AS 1990s
60+
FROM system.columns
61+
WHERE database = currentDatabase() AND table LIKE 'atp_matches%'
62+
GROUP BY ALL
63+
ORDER BY position ASC
64+
)
65+
SETTINGS output_format_pretty_max_value_width=25;
66+
```
67+
68+
```text
69+
┌─name────────┬─1960s────────────┬─1970s───────────┬─1980s────────────┬─1990s─────────────────────┐
70+
│ tourney_id │ String │ String │ String │ String │
71+
│ surface │ String │ String │ String │ Enum8('Hard' = 1, 'Grass'⋯│
72+
│ winner_name │ String │ String │ String │ String │
73+
│ loser_name │ String │ String │ String │ String │
74+
│ winner_seed │ Nullable(String) │ Nullable(UInt8) │ Nullable(UInt16) │ Nullable(UInt16) │
75+
│ loser_seed │ Nullable(UInt8) │ Nullable(UInt8) │ Nullable(UInt16) │ Nullable(UInt16) │
76+
│ score │ String │ Array(String) │ Array(String) │ Array(String) │
77+
│ walkover │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ Nullable(Bool) │
78+
│ retirement │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ Nullable(Bool) │
79+
└─────────────┴──────────────────┴─────────────────┴──────────────────┴───────────────────────────┘
80+
```
81+
82+
Let's go through the differences:
83+
84+
* 1970s changes the type of `winner_seed` from `Nullable(String)` to `Nullable(UInt8)` and `score` from `String` to `Array(String)`.
85+
* 1980s changes `winner_seed` and `loser_seed` from `Nullable(UInt8)` to `Nullable(UInt16)`.
86+
* 1990s changes `surface` from `String` to `Enum('Hard', 'Grass', 'Clay', 'Carpet')` and adds the `walkover` and `retirement` columns.
87+
88+
## Querying multiple tables with merge {#querying-multiple-tables}
89+
90+
Let's write a query to find the matches that John McEnroe won against someone who was seeded #1:
91+
92+
```sql
93+
SELECT loser_name, score
94+
FROM merge('atp_matches*')
95+
WHERE winner_name = 'John McEnroe'
96+
AND loser_seed = 1;
97+
```
98+
99+
```text
100+
┌─loser_name────┬─score───────────────────────────┐
101+
│ Bjorn Borg │ ['6-3','6-4'] │
102+
│ Bjorn Borg │ ['7-6','6-1','6-7','5-7','6-4'] │
103+
│ Bjorn Borg │ ['7-6','6-4'] │
104+
│ Bjorn Borg │ ['4-6','7-6','7-6','6-4'] │
105+
│ Jimmy Connors │ ['6-1','6-3'] │
106+
│ Ivan Lendl │ ['6-2','4-6','6-3','6-7','7-6'] │
107+
│ Ivan Lendl │ ['6-3','3-6','6-3','7-6'] │
108+
│ Ivan Lendl │ ['6-1','6-3'] │
109+
│ Stefan Edberg │ ['6-2','6-3'] │
110+
│ Stefan Edberg │ ['7-6','6-2'] │
111+
│ Stefan Edberg │ ['6-2','6-2'] │
112+
│ Jakob Hlasek │ ['6-3','7-6'] │
113+
└───────────────┴─────────────────────────────────┘
114+
```
115+
116+
Next, let's say we want to filter those matches to find the ones where McEnroe was seeded #3 or lower.
117+
This is a bit trickier because `winner_seed` uses different types across the various tables:
118+
119+
```sql
120+
SELECT loser_name, score, winner_seed
121+
FROM merge('atp_matches*')
122+
WHERE winner_name = 'John McEnroe'
123+
AND loser_seed = 1
124+
AND multiIf(
125+
variantType(winner_seed) = 'UInt8', variantElement(winner_seed, 'UInt8') >= 3,
126+
variantType(winner_seed) = 'UInt16', variantElement(winner_seed, 'UInt16') >= 3,
127+
variantElement(winner_seed, 'String')::UInt16 >= 3
128+
);
129+
```
130+
131+
We use the [`variantType`](/docs/sql-reference/functions/other-functions#varianttype) function to check the type of `winner_seed` for each row and then [`variantElement`](/docs/sql-reference/functions/other-functions#variantelement) to extract the underlying value.
132+
When the type is `String`, we cast to a number and then do the comparison.
133+
The result of running the query is shown below:
134+
135+
```text
136+
┌─loser_name────┬─score─────────┬─winner_seed─┐
137+
│ Bjorn Borg │ ['6-3','6-4'] │ 3 │
138+
│ Stefan Edberg │ ['6-2','6-3'] │ 6 │
139+
│ Stefan Edberg │ ['7-6','6-2'] │ 4 │
140+
│ Stefan Edberg │ ['6-2','6-2'] │ 7 │
141+
└───────────────┴───────────────┴─────────────┘
142+
```
143+
144+
## Which table do rows come from when using merge? {#which-table-merge}
145+
146+
What if we want to know which table rows come from?
147+
We can use the `_table` virtual column to do this, as shown in the following query:
148+
149+
```sql
150+
SELECT _table, loser_name, score, winner_seed
151+
FROM merge('atp_matches*')
152+
WHERE winner_name = 'John McEnroe'
153+
AND loser_seed = 1
154+
AND multiIf(
155+
variantType(winner_seed) = 'UInt8', variantElement(winner_seed, 'UInt8') >= 3,
156+
variantType(winner_seed) = 'UInt16', variantElement(winner_seed, 'UInt16') >= 3,
157+
variantElement(winner_seed, 'String')::UInt16 >= 3
158+
);
159+
```
160+
161+
```text
162+
┌─_table────────────┬─loser_name────┬─score─────────┬─winner_seed─┐
163+
│ atp_matches_1970s │ Bjorn Borg │ ['6-3','6-4'] │ 3 │
164+
│ atp_matches_1980s │ Stefan Edberg │ ['6-2','6-3'] │ 6 │
165+
│ atp_matches_1980s │ Stefan Edberg │ ['7-6','6-2'] │ 4 │
166+
│ atp_matches_1980s │ Stefan Edberg │ ['6-2','6-2'] │ 7 │
167+
└───────────────────┴───────────────┴───────────────┴─────────────┘
168+
```
169+
170+
We could also use this virtual column as part of a query to count the values for the `walkover` column:
171+
172+
173+
```sql
174+
SELECT _table, walkover, count()
175+
FROM merge('atp_matches*')
176+
GROUP BY ALL
177+
ORDER BY _table;
178+
```
179+
180+
```text
181+
┌─_table────────────┬─walkover─┬─count()─┐
182+
│ atp_matches_1960s │ ᴺᵁᴸᴸ │ 7542 │
183+
│ atp_matches_1970s │ ᴺᵁᴸᴸ │ 39165 │
184+
│ atp_matches_1980s │ ᴺᵁᴸᴸ │ 36233 │
185+
│ atp_matches_1990s │ true │ 128 │
186+
│ atp_matches_1990s │ false │ 37022 │
187+
└───────────────────┴──────────┴─────────┘
188+
```
189+
190+
We can see that the `walkover` column is `NULL` for everything except `atp_matches_1990s`.
191+
We'll need to update our query to check whether the `score` column contains the string `W/O` if the `walkover` column is `NULL`:
192+
193+
194+
```sql
195+
SELECT _table,
196+
multiIf(
197+
walkover IS NOT NULL,
198+
walkover,
199+
variantType(score) = 'Array(String)',
200+
toBool(arrayExists(
201+
x -> position(x, 'W/O') > 0,
202+
variantElement(score, 'Array(String)')
203+
)),
204+
variantElement(score, 'String') LIKE '%W/O%'
205+
),
206+
count()
207+
FROM merge('atp_matches*')
208+
GROUP BY ALL
209+
ORDER BY _table;
210+
```
211+
212+
If the underlying type of `score` is `Array(String)` we have to go over the array and look for `W/O`, whereas if it has a type of `String` we can just search for `W/O` in the string.
213+
214+
215+
```text
216+
┌─_table────────────┬─multiIf(isNo⋯, '%W/O%'))─┬─count()─┐
217+
│ atp_matches_1960s │ true │ 242 │
218+
│ atp_matches_1960s │ false │ 7300 │
219+
│ atp_matches_1970s │ true │ 422 │
220+
│ atp_matches_1970s │ false │ 38743 │
221+
│ atp_matches_1980s │ true │ 92 │
222+
│ atp_matches_1980s │ false │ 36141 │
223+
│ atp_matches_1990s │ true │ 128 │
224+
│ atp_matches_1990s │ false │ 37022 │
225+
└───────────────────┴──────────────────────────┴─────────┘
226+
```

scripts/aspell-ignore/en/aspell-dict.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3558,3 +3558,4 @@ columnstore
35583558
TiDB
35593559
resync
35603560
resynchronization
3561+
Sackmann's

sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1184,6 +1184,7 @@ const sidebars = {
11841184
link: { type: "doc", id: "guides/developer/index" },
11851185
items: [
11861186
"guides/developer/dynamic-column-selection",
1187+
"guides/developer/merge-table-function",
11871188
"guides/developer/alternative-query-languages",
11881189
"guides/developer/cascading-materialized-views",
11891190
"guides/developer/debugging-memory-issues",

0 commit comments

Comments
 (0)