Skip to content

Commit 84e3805

Browse files
authored
Merge branch 'antalya-25.3' into list_objects_object_storage_cache_25.3
2 parents 5cd76fc + 482b406 commit 84e3805

File tree

30 files changed

+1418
-231
lines changed

30 files changed

+1418
-231
lines changed

docs/en/engines/table-engines/integrations/iceberg.md

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ To read a table where the schema has changed after its creation with dynamic sch
8585

8686
## Partition Pruning {#partition-pruning}
8787

88-
ClickHouse supports partition pruning during SELECT queries for Iceberg tables, which helps optimize query performance by skipping irrelevant data files. Now it works with only identity transforms and time-based transforms (hour, day, month, year). To enable partition pruning, set `use_iceberg_partition_pruning = 1`.
88+
ClickHouse supports partition pruning during SELECT queries for Iceberg tables, which helps optimize query performance by skipping irrelevant data files. To enable partition pruning, set `use_iceberg_partition_pruning = 1`. For more information about iceberg partition pruning address https://iceberg.apache.org/spec/#partitioning
8989

9090

9191
## Time Travel {#time-travel}
@@ -247,6 +247,43 @@ The second one is that while doing time travel you can't get state of table befo
247247

248248
In Clickhouse the behavior is consistent with Spark. You can mentally replace Spark Select queries with Clickhouse Select queries and it will work the same way.
249249

250+
## Metadata File Resolution {#metadata-file-resolution}
251+
When using the `Iceberg` table engine in ClickHouse, the system needs to locate the correct metadata.json file that describes the Iceberg table structure. Here's how this resolution process works:
252+
253+
### Candidates search (in Priority Order) {#candidate-search}
254+
255+
1. **Direct Path Specification**:
256+
* If you set `iceberg_metadata_file_path`, the system will use this exact path by combining it with the Iceberg table directory path.
257+
* When this setting is provided, all other resolution settings are ignored.
258+
259+
2. **Table UUID Matching**:
260+
* If `iceberg_metadata_table_uuid` is specified, the system will:
261+
* Look only at `.metadata.json` files in the `metadata` directory
262+
* Filter for files containing a `table-uuid` field matching your specified UUID (case-insensitive)
263+
264+
3. **Default Search**:
265+
* If neither of the above settings are provided, all `.metadata.json` files in the `metadata` directory become candidates
266+
267+
### Selecting the Most Recent File {#most-recent-file}
268+
269+
After identifying candidate files using the above rules, the system determines which one is the most recent:
270+
271+
* If `iceberg_recent_metadata_file_by_last_updated_ms_field` is enabled:
272+
* The file with the largest `last-updated-ms` value is selected
273+
274+
* Otherwise:
275+
* The file with the highest version number is selected
276+
* (Version appears as `V` in filenames formatted as `V.metadata.json` or `V-uuid.metadata.json`)
277+
278+
**Note**: All mentioned settings are engine-level settings and must be specified during table creation as shown below:
279+
280+
```sql
281+
CREATE TABLE example_table ENGINE = Iceberg(
282+
's3://bucket/path/to/iceberg_table'
283+
) SETTINGS iceberg_metadata_table_uuid = '6f6f6407-c6a5-465f-a808-ea8900e35a38';
284+
```
285+
286+
**Note**: While Iceberg Catalogs typically handle metadata resolution, the `Iceberg` table engine in ClickHouse directly interprets files stored in S3 as Iceberg tables, which is why understanding these resolution rules is important.
250287

251288
## Data cache {#data-cache}
252289

docs/en/sql-reference/table-functions/iceberg.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ Currently, it is not possible to change nested structures or the types of elemen
7878

7979
## Partition Pruning {#partition-pruning}
8080

81-
ClickHouse supports partition pruning during SELECT queries for Iceberg tables, which helps optimize query performance by skipping irrelevant data files. Now it works with only identity transforms and time-based transforms (hour, day, month, year). To enable partition pruning, set `use_iceberg_partition_pruning = 1`.
81+
ClickHouse supports partition pruning during SELECT queries for Iceberg tables, which helps optimize query performance by skipping irrelevant data files. To enable partition pruning, set `use_iceberg_partition_pruning = 1`. For more information about iceberg partition pruning address https://iceberg.apache.org/spec/#partitioning
8282

8383

8484
## Time Travel {#time-travel}
@@ -239,6 +239,44 @@ The second one is that while doing time travel you can't get state of table befo
239239

240240
In Clickhouse the behavior is consistent with Spark. You can mentally replace Spark Select queries with Clickhouse Select queries and it will work the same way.
241241

242+
## Metadata File Resolution {#metadata-file-resolution}
243+
244+
When using the `iceberg` table function in ClickHouse, the system needs to locate the correct metadata.json file that describes the Iceberg table structure. Here's how this resolution process works:
245+
246+
### Candidate Search (in Priority Order) {#candidate-search}
247+
248+
1. **Direct Path Specification**:
249+
* If you set `iceberg_metadata_file_path`, the system will use this exact path by combining it with the Iceberg table directory path.
250+
* When this setting is provided, all other resolution settings are ignored.
251+
252+
2. **Table UUID Matching**:
253+
* If `iceberg_metadata_table_uuid` is specified, the system will:
254+
* Look only at `.metadata.json` files in the `metadata` directory
255+
* Filter for files containing a `table-uuid` field matching your specified UUID (case-insensitive)
256+
257+
3. **Default Search**:
258+
* If neither of the above settings are provided, all `.metadata.json` files in the `metadata` directory become candidates
259+
260+
### Selecting the Most Recent File {#most-recent-file}
261+
262+
After identifying candidate files using the above rules, the system determines which one is the most recent:
263+
264+
* If `iceberg_recent_metadata_file_by_last_updated_ms_field` is enabled:
265+
* The file with the largest `last-updated-ms` value is selected
266+
267+
* Otherwise:
268+
* The file with the highest version number is selected
269+
* (Version appears as `V` in filenames formatted as `V.metadata.json` or `V-uuid.metadata.json`)
270+
271+
**Note**: All mentioned settings are table function settings (not global or query-level settings) and must be specified as shown below:
272+
273+
```sql
274+
SELECT * FROM iceberg('s3://bucket/path/to/iceberg_table',
275+
SETTINGS iceberg_metadata_table_uuid = 'a90eed4c-f74b-4e5b-b630-096fb9d09021');
276+
```
277+
278+
**Note**: While Iceberg Catalogs typically handle metadata resolution, the `iceberg` table function in ClickHouse directly interprets files stored in S3 as Iceberg tables, which is why understanding these resolution rules is important.
279+
242280
## Metadata cache {#metadata-cache}
243281

244282
`Iceberg` table engine and table function support metadata cache storing the information of manifest files, manifest list and metadata json. The cache is stored in memory. This feature is controlled by setting `use_iceberg_metadata_files_cache`, which is enabled by default.

0 commit comments

Comments
 (0)