Merge pull request #4027 from mneedham/merge-table-function-guide

Blargian · web-flow · commit 780d3d2f7184 · 2025-07-04T13:39:07.000+02:00
merge table engine guide
diff --git a/docs/guides/developer/merge-table-function.md b/docs/guides/developer/merge-table-function.md
@@ -0,0 +1,226 @@
+---
+slug: /guides/developer/merge-table-function
+sidebar_label: 'Merge table function'
+title: 'Merge table function'
+description: 'Query multiple tables at the same time.'
+---
+
+The [merge table function](https://clickhouse.com/docs/sql-reference/table-functions/merge) lets us query multiple tables in parallel.
+It does this by creating a temporary [Merge](https://clickhouse.com/docs/engines/table-engines/special/merge) table and derives this table's structure by taking a union of their columns and by deriving common types.
+
+<iframe width="768" height="432" src="https://www.youtube.com/embed/b4YfRhD9SSI?si=MuoDwDWeikAV5ttk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
+
+## Setup tables {#setup-tables}
+
+We're going to learn how to use this function with help from [Jeff Sackmann's tennis dataset](https://github.com/JeffSackmann/tennis_atp).
+We're going to process CSV files that contain matches going back to the 1960s, but we'll create a slightly different schema for each decade.
+We'll also add a couple of extra columns for the 1990s decade.
+
+The import statements are shown below:
+
+```sql
+CREATE OR REPLACE TABLE atp_matches_1960s ORDER BY tourney_id AS
+SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, score
+FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1968..1969}.csv')
+SETTINGS schema_inference_make_columns_nullable=0, 
+         schema_inference_hints='winner_seed Nullable(String), loser_seed Nullable(UInt8)';
+
+CREATE OR REPLACE TABLE atp_matches_1970s ORDER BY tourney_id AS 
+SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score
+FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1970..1979}.csv')
+SETTINGS schema_inference_make_columns_nullable=0, 
+         schema_inference_hints='winner_seed Nullable(UInt8), loser_seed Nullable(UInt8)';
+
+CREATE OR REPLACE TABLE atp_matches_1980s ORDER BY tourney_id AS
+SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score
+FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1980..1989}.csv')
+SETTINGS schema_inference_make_columns_nullable=0,
+         schema_inference_hints='winner_seed Nullable(UInt16), loser_seed Nullable(UInt16)';
+
+CREATE OR REPLACE TABLE atp_matches_1990s ORDER BY tourney_id AS
+SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score,
+       toBool(arrayExists(x -> position(x, 'W/O') > 0, score))::Nullable(bool) AS walkover,
+       toBool(arrayExists(x -> position(x, 'RET') > 0, score))::Nullable(bool) AS retirement
+FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1990..1999}.csv')
+SETTINGS schema_inference_make_columns_nullable=0,
+         schema_inference_hints='winner_seed Nullable(UInt16), loser_seed Nullable(UInt16), surface Enum(\'Hard\', \'Grass\', \'Clay\', \'Carpet\')';
+```
+
+## Schema of multiple tables {#schema-multiple-tables}
+ 
+We can run the following query to list the columns in each table along with their types side by side, so that it's easier to see the differences.
+
+```sql
+SELECT * EXCEPT(position) FROM (
+    SELECT position, name,
+       any(if(table = 'atp_matches_1960s', type, null)) AS 1960s,
+       any(if(table = 'atp_matches_1970s', type, null)) AS 1970s,
+       any(if(table = 'atp_matches_1980s', type, null)) AS 1980s,
+       any(if(table = 'atp_matches_1990s', type, null)) AS 1990s
+    FROM system.columns
+    WHERE database = currentDatabase() AND table LIKE 'atp_matches%'
+    GROUP BY ALL
+    ORDER BY position ASC
+)
+SETTINGS output_format_pretty_max_value_width=25;
+```
+
+```text
+┌─name────────┬─1960s────────────┬─1970s───────────┬─1980s────────────┬─1990s─────────────────────┐
+│ tourney_id  │ String           │ String          │ String           │ String                    │
+│ surface     │ String           │ String          │ String           │ Enum8('Hard' = 1, 'Grass'⋯│
+│ winner_name │ String           │ String          │ String           │ String                    │
+│ loser_name  │ String           │ String          │ String           │ String                    │
+│ winner_seed │ Nullable(String) │ Nullable(UInt8) │ Nullable(UInt16) │ Nullable(UInt16)          │
+│ loser_seed  │ Nullable(UInt8)  │ Nullable(UInt8) │ Nullable(UInt16) │ Nullable(UInt16)          │
+│ score       │ String           │ Array(String)   │ Array(String)    │ Array(String)             │
+│ walkover    │ ᴺᵁᴸᴸ             │ ᴺᵁᴸᴸ            │ ᴺᵁᴸᴸ             │ Nullable(Bool)            │
+│ retirement  │ ᴺᵁᴸᴸ             │ ᴺᵁᴸᴸ            │ ᴺᵁᴸᴸ             │ Nullable(Bool)            │
+└─────────────┴──────────────────┴─────────────────┴──────────────────┴───────────────────────────┘
+```
+
+Let's go through the differences:
+
+* 1970s changes the type of `winner_seed` from `Nullable(String)` to `Nullable(UInt8)` and `score` from `String` to `Array(String)`.
+* 1980s changes `winner_seed` and `loser_seed` from `Nullable(UInt8)` to `Nullable(UInt16)`.
+* 1990s changes `surface` from `String` to `Enum('Hard', 'Grass', 'Clay', 'Carpet')` and adds the `walkover` and `retirement` columns.
+
+## Querying multiple tables with merge {#querying-multiple-tables}
+
+Let's write a query to find the matches that John McEnroe won against someone who was seeded #1:
+
+```sql
+SELECT loser_name, score
+FROM merge('atp_matches*')
+WHERE winner_name = 'John McEnroe'
+AND loser_seed = 1;
+```
+
+```text
+┌─loser_name────┬─score───────────────────────────┐
+│ Bjorn Borg    │ ['6-3','6-4']                   │
+│ Bjorn Borg    │ ['7-6','6-1','6-7','5-7','6-4'] │
+│ Bjorn Borg    │ ['7-6','6-4']                   │
+│ Bjorn Borg    │ ['4-6','7-6','7-6','6-4']       │
+│ Jimmy Connors │ ['6-1','6-3']                   │
+│ Ivan Lendl    │ ['6-2','4-6','6-3','6-7','7-6'] │
+│ Ivan Lendl    │ ['6-3','3-6','6-3','7-6']       │
+│ Ivan Lendl    │ ['6-1','6-3']                   │
+│ Stefan Edberg │ ['6-2','6-3']                   │
+│ Stefan Edberg │ ['7-6','6-2']                   │
+│ Stefan Edberg │ ['6-2','6-2']                   │
+│ Jakob Hlasek  │ ['6-3','7-6']                   │
+└───────────────┴─────────────────────────────────┘
+```
+
+Next, let's say we want to filter those matches to find the ones where McEnroe was seeded #3 or lower.
+This is a bit trickier because `winner_seed` uses different types across the various tables:
+
+```sql
+SELECT loser_name, score, winner_seed
+FROM merge('atp_matches*')
+WHERE winner_name = 'John McEnroe'
+AND loser_seed = 1
+AND multiIf(
+  variantType(winner_seed) = 'UInt8', variantElement(winner_seed, 'UInt8') >= 3,
+  variantType(winner_seed) = 'UInt16', variantElement(winner_seed, 'UInt16') >= 3,
+  variantElement(winner_seed, 'String')::UInt16 >= 3
+);
+```
+
+We use the [`variantType`](/docs/sql-reference/functions/other-functions#varianttype) function to check the type of `winner_seed` for each row and then [`variantElement`](/docs/sql-reference/functions/other-functions#variantelement) to extract the underlying value.
+When the type is `String`, we cast to a number and then do the comparison.
+The result of running the query is shown below:
+
+```text
+┌─loser_name────┬─score─────────┬─winner_seed─┐
+│ Bjorn Borg    │ ['6-3','6-4'] │ 3           │
+│ Stefan Edberg │ ['6-2','6-3'] │ 6           │
+│ Stefan Edberg │ ['7-6','6-2'] │ 4           │
+│ Stefan Edberg │ ['6-2','6-2'] │ 7           │
+└───────────────┴───────────────┴─────────────┘
+```
+
+## Which table do rows come from when using merge? {#which-table-merge}
+
+What if we want to know which table rows come from?
+We can use the `_table` virtual column to do this, as shown in the following query:
+
+```sql
+SELECT _table, loser_name, score, winner_seed
+FROM merge('atp_matches*')
+WHERE winner_name = 'John McEnroe'
+AND loser_seed = 1
+AND multiIf(
+  variantType(winner_seed) = 'UInt8', variantElement(winner_seed, 'UInt8') >= 3,
+  variantType(winner_seed) = 'UInt16', variantElement(winner_seed, 'UInt16') >= 3,
+  variantElement(winner_seed, 'String')::UInt16 >= 3
+);
+```
+
+```text
+┌─_table────────────┬─loser_name────┬─score─────────┬─winner_seed─┐
+│ atp_matches_1970s │ Bjorn Borg    │ ['6-3','6-4'] │ 3           │
+│ atp_matches_1980s │ Stefan Edberg │ ['6-2','6-3'] │ 6           │
+│ atp_matches_1980s │ Stefan Edberg │ ['7-6','6-2'] │ 4           │
+│ atp_matches_1980s │ Stefan Edberg │ ['6-2','6-2'] │ 7           │
+└───────────────────┴───────────────┴───────────────┴─────────────┘
+```
+
+We could also use this virtual column as part of a query to count the values for the `walkover` column:
+
+
+```sql
+SELECT _table, walkover, count()
+FROM merge('atp_matches*')
+GROUP BY ALL
+ORDER BY _table;
+```
+
+```text
+┌─_table────────────┬─walkover─┬─count()─┐
+│ atp_matches_1960s │ ᴺᵁᴸᴸ     │    7542 │
+│ atp_matches_1970s │ ᴺᵁᴸᴸ     │   39165 │
+│ atp_matches_1980s │ ᴺᵁᴸᴸ     │   36233 │
+│ atp_matches_1990s │ true     │     128 │
+│ atp_matches_1990s │ false    │   37022 │
+└───────────────────┴──────────┴─────────┘
+```
+
+We can see that the `walkover` column is `NULL` for everything except `atp_matches_1990s`.
+We'll need to update our query to check whether the `score` column contains the string `W/O` if the `walkover` column is `NULL`:
+
+
+```sql
+SELECT _table,
+   multiIf(
+     walkover IS NOT NULL,
+     walkover,
+     variantType(score) = 'Array(String)',
+     toBool(arrayExists(
+        x -> position(x, 'W/O') > 0,
+        variantElement(score, 'Array(String)')
+     )),
+     variantElement(score, 'String') LIKE '%W/O%'
+   ),
+   count()
+FROM merge('atp_matches*')
+GROUP BY ALL
+ORDER BY _table;
+```
+
+If the underlying type of `score` is `Array(String)` we have to go over the array and look for `W/O`, whereas if it has a type of `String` we can just search for `W/O` in the string.
+
+
+```text
+┌─_table────────────┬─multiIf(isNo⋯, '%W/O%'))─┬─count()─┐
+│ atp_matches_1960s │ true                     │     242 │
+│ atp_matches_1960s │ false                    │    7300 │
+│ atp_matches_1970s │ true                     │     422 │
+│ atp_matches_1970s │ false                    │   38743 │
+│ atp_matches_1980s │ true                     │      92 │
+│ atp_matches_1980s │ false                    │   36141 │
+│ atp_matches_1990s │ true                     │     128 │
+│ atp_matches_1990s │ false                    │   37022 │
+└───────────────────┴──────────────────────────┴─────────┘
+```
diff --git a/scripts/aspell-ignore/en/aspell-dict.txt b/scripts/aspell-ignore/en/aspell-dict.txt
@@ -3558,3 +3558,4 @@ columnstore
 TiDB
 resync
 resynchronization
+Sackmann's
diff --git a/sidebars.js b/sidebars.js
@@ -1184,6 +1184,7 @@ const sidebars = {
       link: { type: "doc", id: "guides/developer/index" },
       items: [
         "guides/developer/dynamic-column-selection",
+        "guides/developer/merge-table-function",
         "guides/developer/alternative-query-languages",
         "guides/developer/cascading-materialized-views",
         "guides/developer/debugging-memory-issues",