You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`Hybrid` builds on top of the [Distributed](./distributed.md) table engine. It lets you expose several data sources as one logical table and assign every source its own predicate.
12
+
The engine rewrites incoming queries so that each layer receives the original query plus its predicate. This keeps all of the Distributed optimisations (remote aggregation, `skip_unused_shards`,
13
+
global JOIN pushdown, and so on) while you duplicate or migrate data across clusters, storage types, or formats.
14
+
15
+
It keeps the same execution pipeline as `engine=Distributed` but can read from multiple underlying sources simultaneously—similar to `engine=Merge`—while still pushing logic down to each source.
16
+
17
+
Typical use cases include:
18
+
19
+
- Zero-downtime migrations where "old" and "new" replicas temporarily overlap.
20
+
- Tiered storage, for example fresh data on a local cluster and historical data in S3.
21
+
- Gradual roll-outs where only a subset of rows should be served from a new backend.
22
+
23
+
By giving mutually exclusive predicates to the layers (for example, `date < watermark` and `date >= watermark`), you ensure that each row is read from exactly one source.
You must pass at least two arguments – the first table function and its predicate. Additional sources are appended as `table_function, predicate` pairs. The first table function is also used for `INSERT` statements.
38
+
39
+
### Arguments and behaviour
40
+
41
+
-`table_function_n` must be a valid table function (for example `remote`, `remoteSecure`, `cluster`, `clusterAllReplicas`, `s3Cluster`) or a fully qualified table name (`database.table`). The first argument must be a table function—such as `remote` or `cluster`—because it instantiates the underlying `Distributed` storage.
42
+
-`predicate_n` must be an expression that can be evaluated on the table columns. The engine adds it to the layer's query with an additional `AND`, so expressions like `event_date >= '2025-09-01'` or `id BETWEEN 10 AND 15` are typical.
43
+
- The query planner picks the same processing stage for every layer as it does for the base `Distributed` plan, so remote aggregation, ORDER BY pushdown, `skip_unused_shards`, and the legacy/analyzer execution modes behave the same way.
44
+
-`INSERT` statements are forwarded to the first table function only. If you need multi-destination writes, use explicit `INSERT` statements into the respective sources.
45
+
- Align schemas across the layers. ClickHouse builds a common header; if the physical types differ you may need to add casts on one side or in the query, just as you would when reading from heterogeneous replicas.
46
+
47
+
## Example: local cluster plus S3 historical tier
48
+
49
+
The following commands illustrate a two-layer layout. Hot data stays on a local ClickHouse cluster, while historical rows come from public S3 Parquet files.
50
+
51
+
```sql
52
+
-- Local MergeTree table that keeps current data
53
+
CREATE OR REPLACETABLEbtc_blocks_local
54
+
(
55
+
`hash` FixedString(64),
56
+
`version` Int64,
57
+
`mediantime` DateTime64(9),
58
+
`nonce` Int64,
59
+
`bits` FixedString(8),
60
+
`difficulty` Float64,
61
+
`chainwork` FixedString(64),
62
+
`size` Int64,
63
+
`weight` Int64,
64
+
`coinbase_param` String,
65
+
`number` Int64,
66
+
`transaction_count` Int64,
67
+
`merkle_root` FixedString(64),
68
+
`stripped_size` Int64,
69
+
`timestamp` DateTime64(9),
70
+
`date`Date
71
+
)
72
+
ENGINE = MergeTree
73
+
ORDER BY (timestamp)
74
+
PARTITION BY toYYYYMM(date);
75
+
76
+
-- Hybrid table that unions the local shard with historical data in S3
FROM s3('s3://aws-public-blockchain/v1.0/btc/blocks/**.parquet', NOSIGN)
86
+
WHEREdate BETWEEN '2025-09-01'AND'2025-09-30';
87
+
88
+
-- Reads seamlessly combine both predicates
89
+
SELECT*FROM btc_blocks WHEREdate='2025-08-01'; -- data from s3
90
+
SELECT*FROM btc_blocks WHEREdate='2025-09-05'; -- data from MergeTree (TODO: still analyzes s3)
91
+
SELECT*FROM btc_blocks WHEREdateIN ('2025-08-31','2025-09-01') -- data from both sources, single copy always
92
+
93
+
94
+
-- Run analytic queries as usual
95
+
SELECT
96
+
date,
97
+
count(),
98
+
uniqExact(CAST(hash, 'Nullable(String)')) AS hashes,
99
+
sum(CAST(number, 'Nullable(Int64)')) AS blocks_seen
100
+
FROM btc_blocks
101
+
WHEREdate BETWEEN '2025-08-01'AND'2025-09-30'
102
+
GROUP BYdate
103
+
ORDER BYdate;
104
+
```
105
+
106
+
Because the predicates are applied inside every layer, queries such as `ORDER BY`, `GROUP BY`, `LIMIT`, `JOIN`, and `EXPLAIN` behave as if you were reading from a single `Distributed` table. When sources expose different physical types (for example `FixedString(64)` versus `String` in Parquet), add explicit casts during ingestion or in the query, as shown above.
0 commit comments