|
| 1 | +--- |
| 2 | +slug: /guides/developer/merge-table-function |
| 3 | +sidebar_label: 'Merge table function' |
| 4 | +title: 'Merge table function' |
| 5 | +description: 'Query multiple tables at the same time.' |
| 6 | +--- |
| 7 | + |
| 8 | +The [merge table function](https://clickhouse.com/docs/sql-reference/table-functions/merge) lets us query multiple tables in parallel. |
| 9 | +It does this by creating a temporary [Merge](https://clickhouse.com/docs/engines/table-engines/special/merge) table and derives this table's structure by taking a union of their columns and by deriving common types. |
| 10 | + |
| 11 | +<iframe width="768" height="432" src="https://www.youtube.com/embed/b4YfRhD9SSI?si=MuoDwDWeikAV5ttk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> |
| 12 | + |
| 13 | +## Setup tables {#setup-tables} |
| 14 | + |
| 15 | +We're going to learn how to use this function with help from [Jeff Sackmann's tennis dataset](https://github.com/JeffSackmann/tennis_atp). |
| 16 | +We're going to process CSV files that contain matches going back to the 1960s, but we'll create a slightly different schema for each decade. |
| 17 | +We'll also add a couple of extra columns for the 1990s decade. |
| 18 | + |
| 19 | +The import statements are shown below: |
| 20 | + |
| 21 | +```sql |
| 22 | +CREATE OR REPLACE TABLE atp_matches_1960s ORDER BY tourney_id AS |
| 23 | +SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, score |
| 24 | +FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1968..1969}.csv') |
| 25 | +SETTINGS schema_inference_make_columns_nullable=0, |
| 26 | + schema_inference_hints='winner_seed Nullable(String), loser_seed Nullable(UInt8)'; |
| 27 | + |
| 28 | +CREATE OR REPLACE TABLE atp_matches_1970s ORDER BY tourney_id AS |
| 29 | +SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score |
| 30 | +FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1970..1979}.csv') |
| 31 | +SETTINGS schema_inference_make_columns_nullable=0, |
| 32 | + schema_inference_hints='winner_seed Nullable(UInt8), loser_seed Nullable(UInt8)'; |
| 33 | + |
| 34 | +CREATE OR REPLACE TABLE atp_matches_1980s ORDER BY tourney_id AS |
| 35 | +SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score |
| 36 | +FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1980..1989}.csv') |
| 37 | +SETTINGS schema_inference_make_columns_nullable=0, |
| 38 | + schema_inference_hints='winner_seed Nullable(UInt16), loser_seed Nullable(UInt16)'; |
| 39 | + |
| 40 | +CREATE OR REPLACE TABLE atp_matches_1990s ORDER BY tourney_id AS |
| 41 | +SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score, |
| 42 | + toBool(arrayExists(x -> position(x, 'W/O') > 0, score))::Nullable(bool) AS walkover, |
| 43 | + toBool(arrayExists(x -> position(x, 'RET') > 0, score))::Nullable(bool) AS retirement |
| 44 | +FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1990..1999}.csv') |
| 45 | +SETTINGS schema_inference_make_columns_nullable=0, |
| 46 | + schema_inference_hints='winner_seed Nullable(UInt16), loser_seed Nullable(UInt16), surface Enum(\'Hard\', \'Grass\', \'Clay\', \'Carpet\')'; |
| 47 | +``` |
| 48 | + |
| 49 | +## Schema of multiple tables {#schema-multiple-tables} |
| 50 | + |
| 51 | +We can run the following query to list the columns in each table along with their types side by side, so that it's easier to see the differences. |
| 52 | + |
| 53 | +```sql |
| 54 | +SELECT * EXCEPT(position) FROM ( |
| 55 | + SELECT position, name, |
| 56 | + any(if(table = 'atp_matches_1960s', type, null)) AS 1960s, |
| 57 | + any(if(table = 'atp_matches_1970s', type, null)) AS 1970s, |
| 58 | + any(if(table = 'atp_matches_1980s', type, null)) AS 1980s, |
| 59 | + any(if(table = 'atp_matches_1990s', type, null)) AS 1990s |
| 60 | + FROM system.columns |
| 61 | + WHERE database = currentDatabase() AND table LIKE 'atp_matches%' |
| 62 | + GROUP BY ALL |
| 63 | + ORDER BY position ASC |
| 64 | +) |
| 65 | +SETTINGS output_format_pretty_max_value_width=25; |
| 66 | +``` |
| 67 | + |
| 68 | +```text |
| 69 | +┌─name────────┬─1960s────────────┬─1970s───────────┬─1980s────────────┬─1990s─────────────────────┐ |
| 70 | +│ tourney_id │ String │ String │ String │ String │ |
| 71 | +│ surface │ String │ String │ String │ Enum8('Hard' = 1, 'Grass'⋯│ |
| 72 | +│ winner_name │ String │ String │ String │ String │ |
| 73 | +│ loser_name │ String │ String │ String │ String │ |
| 74 | +│ winner_seed │ Nullable(String) │ Nullable(UInt8) │ Nullable(UInt16) │ Nullable(UInt16) │ |
| 75 | +│ loser_seed │ Nullable(UInt8) │ Nullable(UInt8) │ Nullable(UInt16) │ Nullable(UInt16) │ |
| 76 | +│ score │ String │ Array(String) │ Array(String) │ Array(String) │ |
| 77 | +│ walkover │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ Nullable(Bool) │ |
| 78 | +│ retirement │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ Nullable(Bool) │ |
| 79 | +└─────────────┴──────────────────┴─────────────────┴──────────────────┴───────────────────────────┘ |
| 80 | +``` |
| 81 | + |
| 82 | +Let's go through the differences: |
| 83 | + |
| 84 | +* 1970s changes the type of `winner_seed` from `Nullable(String)` to `Nullable(UInt8)` and `score` from `String` to `Array(String)`. |
| 85 | +* 1980s changes `winner_seed` and `loser_seed` from `Nullable(UInt8)` to `Nullable(UInt16)`. |
| 86 | +* 1990s changes `surface` from `String` to `Enum('Hard', 'Grass', 'Clay', 'Carpet')` and adds the `walkover` and `retirement` columns. |
| 87 | + |
| 88 | +## Querying multiple tables with merge {#querying-multiple-tables} |
| 89 | + |
| 90 | +Let's write a query to find the matches that John McEnroe won against someone who was seeded #1: |
| 91 | + |
| 92 | +```sql |
| 93 | +SELECT loser_name, score |
| 94 | +FROM merge('atp_matches*') |
| 95 | +WHERE winner_name = 'John McEnroe' |
| 96 | +AND loser_seed = 1; |
| 97 | +``` |
| 98 | + |
| 99 | +```text |
| 100 | +┌─loser_name────┬─score───────────────────────────┐ |
| 101 | +│ Bjorn Borg │ ['6-3','6-4'] │ |
| 102 | +│ Bjorn Borg │ ['7-6','6-1','6-7','5-7','6-4'] │ |
| 103 | +│ Bjorn Borg │ ['7-6','6-4'] │ |
| 104 | +│ Bjorn Borg │ ['4-6','7-6','7-6','6-4'] │ |
| 105 | +│ Jimmy Connors │ ['6-1','6-3'] │ |
| 106 | +│ Ivan Lendl │ ['6-2','4-6','6-3','6-7','7-6'] │ |
| 107 | +│ Ivan Lendl │ ['6-3','3-6','6-3','7-6'] │ |
| 108 | +│ Ivan Lendl │ ['6-1','6-3'] │ |
| 109 | +│ Stefan Edberg │ ['6-2','6-3'] │ |
| 110 | +│ Stefan Edberg │ ['7-6','6-2'] │ |
| 111 | +│ Stefan Edberg │ ['6-2','6-2'] │ |
| 112 | +│ Jakob Hlasek │ ['6-3','7-6'] │ |
| 113 | +└───────────────┴─────────────────────────────────┘ |
| 114 | +``` |
| 115 | + |
| 116 | +Next, let's say we want to filter those matches to find the ones where McEnroe was seeded #3 or lower. |
| 117 | +This is a bit trickier because `winner_seed` uses different types across the various tables: |
| 118 | + |
| 119 | +```sql |
| 120 | +SELECT loser_name, score, winner_seed |
| 121 | +FROM merge('atp_matches*') |
| 122 | +WHERE winner_name = 'John McEnroe' |
| 123 | +AND loser_seed = 1 |
| 124 | +AND multiIf( |
| 125 | + variantType(winner_seed) = 'UInt8', variantElement(winner_seed, 'UInt8') >= 3, |
| 126 | + variantType(winner_seed) = 'UInt16', variantElement(winner_seed, 'UInt16') >= 3, |
| 127 | + variantElement(winner_seed, 'String')::UInt16 >= 3 |
| 128 | +); |
| 129 | +``` |
| 130 | + |
| 131 | +We use the [`variantType`](/docs/sql-reference/functions/other-functions#varianttype) function to check the type of `winner_seed` for each row and then [`variantElement`](/docs/sql-reference/functions/other-functions#variantelement) to extract the underlying value. |
| 132 | +When the type is `String`, we cast to a number and then do the comparison. |
| 133 | +The result of running the query is shown below: |
| 134 | + |
| 135 | +```text |
| 136 | +┌─loser_name────┬─score─────────┬─winner_seed─┐ |
| 137 | +│ Bjorn Borg │ ['6-3','6-4'] │ 3 │ |
| 138 | +│ Stefan Edberg │ ['6-2','6-3'] │ 6 │ |
| 139 | +│ Stefan Edberg │ ['7-6','6-2'] │ 4 │ |
| 140 | +│ Stefan Edberg │ ['6-2','6-2'] │ 7 │ |
| 141 | +└───────────────┴───────────────┴─────────────┘ |
| 142 | +``` |
| 143 | + |
| 144 | +## Which table do rows come from when using merge? {#which-table-merge} |
| 145 | + |
| 146 | +What if we want to know which table rows come from? |
| 147 | +We can use the `_table` virtual column to do this, as shown in the following query: |
| 148 | + |
| 149 | +```sql |
| 150 | +SELECT _table, loser_name, score, winner_seed |
| 151 | +FROM merge('atp_matches*') |
| 152 | +WHERE winner_name = 'John McEnroe' |
| 153 | +AND loser_seed = 1 |
| 154 | +AND multiIf( |
| 155 | + variantType(winner_seed) = 'UInt8', variantElement(winner_seed, 'UInt8') >= 3, |
| 156 | + variantType(winner_seed) = 'UInt16', variantElement(winner_seed, 'UInt16') >= 3, |
| 157 | + variantElement(winner_seed, 'String')::UInt16 >= 3 |
| 158 | +); |
| 159 | +``` |
| 160 | + |
| 161 | +```text |
| 162 | +┌─_table────────────┬─loser_name────┬─score─────────┬─winner_seed─┐ |
| 163 | +│ atp_matches_1970s │ Bjorn Borg │ ['6-3','6-4'] │ 3 │ |
| 164 | +│ atp_matches_1980s │ Stefan Edberg │ ['6-2','6-3'] │ 6 │ |
| 165 | +│ atp_matches_1980s │ Stefan Edberg │ ['7-6','6-2'] │ 4 │ |
| 166 | +│ atp_matches_1980s │ Stefan Edberg │ ['6-2','6-2'] │ 7 │ |
| 167 | +└───────────────────┴───────────────┴───────────────┴─────────────┘ |
| 168 | +``` |
| 169 | + |
| 170 | +We could also use this virtual column as part of a query to count the values for the `walkover` column: |
| 171 | + |
| 172 | + |
| 173 | +```sql |
| 174 | +SELECT _table, walkover, count() |
| 175 | +FROM merge('atp_matches*') |
| 176 | +GROUP BY ALL |
| 177 | +ORDER BY _table; |
| 178 | +``` |
| 179 | + |
| 180 | +```text |
| 181 | +┌─_table────────────┬─walkover─┬─count()─┐ |
| 182 | +│ atp_matches_1960s │ ᴺᵁᴸᴸ │ 7542 │ |
| 183 | +│ atp_matches_1970s │ ᴺᵁᴸᴸ │ 39165 │ |
| 184 | +│ atp_matches_1980s │ ᴺᵁᴸᴸ │ 36233 │ |
| 185 | +│ atp_matches_1990s │ true │ 128 │ |
| 186 | +│ atp_matches_1990s │ false │ 37022 │ |
| 187 | +└───────────────────┴──────────┴─────────┘ |
| 188 | +``` |
| 189 | + |
| 190 | +We can see that the `walkover` column is `NULL` for everything except `atp_matches_1990s`. |
| 191 | +We'll need to update our query to check whether the `score` column contains the string `W/O` if the `walkover` column is `NULL`: |
| 192 | + |
| 193 | + |
| 194 | +```sql |
| 195 | +SELECT _table, |
| 196 | + multiIf( |
| 197 | + walkover IS NOT NULL, |
| 198 | + walkover, |
| 199 | + variantType(score) = 'Array(String)', |
| 200 | + toBool(arrayExists( |
| 201 | + x -> position(x, 'W/O') > 0, |
| 202 | + variantElement(score, 'Array(String)') |
| 203 | + )), |
| 204 | + variantElement(score, 'String') LIKE '%W/O%' |
| 205 | + ), |
| 206 | + count() |
| 207 | +FROM merge('atp_matches*') |
| 208 | +GROUP BY ALL |
| 209 | +ORDER BY _table; |
| 210 | +``` |
| 211 | + |
| 212 | +If the underlying type of `score` is `Array(String)` we have to go over the array and look for `W/O`, whereas if it has a type of `String` we can just search for `W/O` in the string. |
| 213 | + |
| 214 | + |
| 215 | +```text |
| 216 | +┌─_table────────────┬─multiIf(isNo⋯, '%W/O%'))─┬─count()─┐ |
| 217 | +│ atp_matches_1960s │ true │ 242 │ |
| 218 | +│ atp_matches_1960s │ false │ 7300 │ |
| 219 | +│ atp_matches_1970s │ true │ 422 │ |
| 220 | +│ atp_matches_1970s │ false │ 38743 │ |
| 221 | +│ atp_matches_1980s │ true │ 92 │ |
| 222 | +│ atp_matches_1980s │ false │ 36141 │ |
| 223 | +│ atp_matches_1990s │ true │ 128 │ |
| 224 | +│ atp_matches_1990s │ false │ 37022 │ |
| 225 | +└───────────────────┴──────────────────────────┴─────────┘ |
| 226 | +``` |
0 commit comments