|
| 1 | +--- |
| 2 | +description: 'Hybrid unions multiple data sources behind per-layer predicates so queries behave like a single table while data is migrated or tiered.' |
| 3 | +slug: /engines/table-engines/special/tiered-distributed |
| 4 | +title: 'Hybrid Table Engine' |
| 5 | +sidebar_label: 'Hybrid' |
| 6 | +sidebar_position: 11 |
| 7 | +--- |
| 8 | + |
| 9 | +# Hybrid table engine |
| 10 | + |
| 11 | +`Hybrid` builds on top of the [Distributed](./distributed.md) table engine. It lets you expose several data sources as one logical table and assign every source its own predicate. |
| 12 | +The engine rewrites incoming queries so that each layer receives the original query plus its predicate. This keeps all of the Distributed optimisations (remote aggregation, `skip_unused_shards`, |
| 13 | +global JOIN pushdown, and so on) while you duplicate or migrate data across clusters, storage types, or formats. |
| 14 | + |
| 15 | +It keeps the same execution pipeline as `engine=Distributed` but can read from multiple underlying sources simultaneously—similar to `engine=Merge`—while still pushing logic down to each source. |
| 16 | + |
| 17 | +Typical use cases include: |
| 18 | + |
| 19 | +- Zero-downtime migrations where "old" and "new" replicas temporarily overlap. |
| 20 | +- Tiered storage, for example fresh data on a local cluster and historical data in S3. |
| 21 | +- Gradual roll-outs where only a subset of rows should be served from a new backend. |
| 22 | + |
| 23 | +By giving mutually exclusive predicates to the layers (for example, `date < watermark` and `date >= watermark`), you ensure that each row is read from exactly one source. |
| 24 | + |
| 25 | +## Engine definition |
| 26 | + |
| 27 | +```sql |
| 28 | +CREATE TABLE [IF NOT EXISTS] [db.]table_name |
| 29 | +( |
| 30 | + column1 type1, |
| 31 | + column2 type2, |
| 32 | + ... |
| 33 | +) |
| 34 | +ENGINE = Hybrid(table_function_1, predicate_1 [, table_function_2, predicate_2 ...]) |
| 35 | +``` |
| 36 | + |
| 37 | +You must pass at least two arguments – the first table function and its predicate. Additional sources are appended as `table_function, predicate` pairs. The first table function is also used for `INSERT` statements. |
| 38 | + |
| 39 | +### Arguments and behaviour |
| 40 | + |
| 41 | +- `table_function_n` must be a valid table function (for example `remote`, `remoteSecure`, `cluster`, `clusterAllReplicas`, `s3Cluster`) or a fully qualified table name (`database.table`). The first argument must be a table function—such as `remote` or `cluster`—because it instantiates the underlying `Distributed` storage. |
| 42 | +- `predicate_n` must be an expression that can be evaluated on the table columns. The engine adds it to the layer's query with an additional `AND`, so expressions like `event_date >= '2025-09-01'` or `id BETWEEN 10 AND 15` are typical. |
| 43 | +- The query planner picks the same processing stage for every layer as it does for the base `Distributed` plan, so remote aggregation, ORDER BY pushdown, `skip_unused_shards`, and the legacy/analyzer execution modes behave the same way. |
| 44 | +- `INSERT` statements are forwarded to the first table function only. If you need multi-destination writes, use explicit `INSERT` statements into the respective sources. |
| 45 | +- Align schemas across the layers. ClickHouse builds a common header; if the physical types differ you may need to add casts on one side or in the query, just as you would when reading from heterogeneous replicas. |
| 46 | + |
| 47 | +## Example: local cluster plus S3 historical tier |
| 48 | + |
| 49 | +The following commands illustrate a two-layer layout. Hot data stays on a local ClickHouse cluster, while historical rows come from public S3 Parquet files. |
| 50 | + |
| 51 | +```sql |
| 52 | +-- Local MergeTree table that keeps current data |
| 53 | +CREATE OR REPLACE TABLE btc_blocks_local |
| 54 | +( |
| 55 | + `hash` FixedString(64), |
| 56 | + `version` Int64, |
| 57 | + `mediantime` DateTime64(9), |
| 58 | + `nonce` Int64, |
| 59 | + `bits` FixedString(8), |
| 60 | + `difficulty` Float64, |
| 61 | + `chainwork` FixedString(64), |
| 62 | + `size` Int64, |
| 63 | + `weight` Int64, |
| 64 | + `coinbase_param` String, |
| 65 | + `number` Int64, |
| 66 | + `transaction_count` Int64, |
| 67 | + `merkle_root` FixedString(64), |
| 68 | + `stripped_size` Int64, |
| 69 | + `timestamp` DateTime64(9), |
| 70 | + `date` Date |
| 71 | +) |
| 72 | +ENGINE = MergeTree |
| 73 | +ORDER BY (timestamp) |
| 74 | +PARTITION BY toYYYYMM(date); |
| 75 | + |
| 76 | +-- Hybrid table that unions the local shard with historical data in S3 |
| 77 | +CREATE OR REPLACE TABLE btc_blocks ENGINE = Hybrid( |
| 78 | + remote('localhost:9000', currentDatabase(), 'btc_blocks_local'), date >= '2025-09-01', |
| 79 | + s3('s3://aws-public-blockchain/v1.0/btc/blocks/**.parquet', NOSIGN), date < '2025-09-01' |
| 80 | +) AS btc_blocks_local; |
| 81 | + |
| 82 | +-- Writes target the first (remote) layer |
| 83 | +INSERT INTO btc_blocks |
| 84 | +SELECT * |
| 85 | +FROM s3('s3://aws-public-blockchain/v1.0/btc/blocks/**.parquet', NOSIGN) |
| 86 | +WHERE date BETWEEN '2025-09-01' AND '2025-09-30'; |
| 87 | + |
| 88 | +-- Reads seamlessly combine both predicates |
| 89 | +SELECT * FROM btc_blocks WHERE date = '2025-08-01'; -- data from s3 |
| 90 | +SELECT * FROM btc_blocks WHERE date = '2025-09-05'; -- data from MergeTree (TODO: still analyzes s3) |
| 91 | +SELECT * FROM btc_blocks WHERE date IN ('2025-08-31','2025-09-01') -- data from both sources, single copy always |
| 92 | + |
| 93 | + |
| 94 | +-- Run analytic queries as usual |
| 95 | +SELECT |
| 96 | + date, |
| 97 | + count(), |
| 98 | + uniqExact(CAST(hash, 'Nullable(String)')) AS hashes, |
| 99 | + sum(CAST(number, 'Nullable(Int64)')) AS blocks_seen |
| 100 | +FROM btc_blocks |
| 101 | +WHERE date BETWEEN '2025-08-01' AND '2025-09-30' |
| 102 | +GROUP BY date |
| 103 | +ORDER BY date; |
| 104 | +``` |
| 105 | + |
| 106 | +Because the predicates are applied inside every layer, queries such as `ORDER BY`, `GROUP BY`, `LIMIT`, `JOIN`, and `EXPLAIN` behave as if you were reading from a single `Distributed` table. When sources expose different physical types (for example `FixedString(64)` versus `String` in Parquet), add explicit casts during ingestion or in the query, as shown above. |
0 commit comments