You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/engines/table-engines/integrations/iceberg.md
+37Lines changed: 37 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -247,6 +247,43 @@ The second one is that while doing time travel you can't get state of table befo
247
247
248
248
In Clickhouse the behavior is consistent with Spark. You can mentally replace Spark Select queries with Clickhouse Select queries and it will work the same way.
When using the `Iceberg` table engine in ClickHouse, the system needs to locate the correct metadata.json file that describes the Iceberg table structure. Here's how this resolution process works:
252
+
253
+
### Candidates search (in Priority Order) {#candidate-search}
254
+
255
+
1.**Direct Path Specification**:
256
+
* If you set `iceberg_metadata_file_path`, the system will use this exact path by combining it with the Iceberg table directory path.
257
+
* When this setting is provided, all other resolution settings are ignored.
258
+
259
+
2.**Table UUID Matching**:
260
+
* If `iceberg_metadata_table_uuid` is specified, the system will:
261
+
* Look only at `.metadata.json` files in the `metadata` directory
262
+
* Filter for files containing a `table-uuid` field matching your specified UUID (case-insensitive)
263
+
264
+
3.**Default Search**:
265
+
* If neither of the above settings are provided, all `.metadata.json` files in the `metadata` directory become candidates
266
+
267
+
### Selecting the Most Recent File {#most-recent-file}
268
+
269
+
After identifying candidate files using the above rules, the system determines which one is the most recent:
270
+
271
+
* If `iceberg_recent_metadata_file_by_last_updated_ms_field` is enabled:
272
+
* The file with the largest `last-updated-ms` value is selected
273
+
274
+
* Otherwise:
275
+
* The file with the highest version number is selected
276
+
* (Version appears as `V` in filenames formatted as `V.metadata.json` or `V-uuid.metadata.json`)
277
+
278
+
**Note**: All mentioned settings are engine-level settings and must be specified during table creation as shown below:
**Note**: While Iceberg Catalogs typically handle metadata resolution, the `Iceberg` table engine in ClickHouse directly interprets files stored in S3 as Iceberg tables, which is why understanding these resolution rules is important.
Copy file name to clipboardExpand all lines: docs/en/sql-reference/table-functions/iceberg.md
+38Lines changed: 38 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -239,6 +239,44 @@ The second one is that while doing time travel you can't get state of table befo
239
239
240
240
In Clickhouse the behavior is consistent with Spark. You can mentally replace Spark Select queries with Clickhouse Select queries and it will work the same way.
When using the `iceberg` table function in ClickHouse, the system needs to locate the correct metadata.json file that describes the Iceberg table structure. Here's how this resolution process works:
245
+
246
+
### Candidate Search (in Priority Order) {#candidate-search}
247
+
248
+
1.**Direct Path Specification**:
249
+
* If you set `iceberg_metadata_file_path`, the system will use this exact path by combining it with the Iceberg table directory path.
250
+
* When this setting is provided, all other resolution settings are ignored.
251
+
252
+
2.**Table UUID Matching**:
253
+
* If `iceberg_metadata_table_uuid` is specified, the system will:
254
+
* Look only at `.metadata.json` files in the `metadata` directory
255
+
* Filter for files containing a `table-uuid` field matching your specified UUID (case-insensitive)
256
+
257
+
3.**Default Search**:
258
+
* If neither of the above settings are provided, all `.metadata.json` files in the `metadata` directory become candidates
259
+
260
+
### Selecting the Most Recent File {#most-recent-file}
261
+
262
+
After identifying candidate files using the above rules, the system determines which one is the most recent:
263
+
264
+
* If `iceberg_recent_metadata_file_by_last_updated_ms_field` is enabled:
265
+
* The file with the largest `last-updated-ms` value is selected
266
+
267
+
* Otherwise:
268
+
* The file with the highest version number is selected
269
+
* (Version appears as `V` in filenames formatted as `V.metadata.json` or `V-uuid.metadata.json`)
270
+
271
+
**Note**: All mentioned settings are table function settings (not global or query-level settings) and must be specified as shown below:
**Note**: While Iceberg Catalogs typically handle metadata resolution, the `iceberg` table function in ClickHouse directly interprets files stored in S3 as Iceberg tables, which is why understanding these resolution rules is important.
279
+
242
280
## Metadata cache {#metadata-cache}
243
281
244
282
`Iceberg` table engine and table function support metadata cache storing the information of manifest files, manifest list and metadata json. The cache is stored in memory. This feature is controlled by setting `use_iceberg_metadata_files_cache`, which is enabled by default.
If enabled, the engine would use the metadata file with the most recent last_updated_ms json field. Does not make sense to use with iceberg_metadata_file_path.
SELECT*FROM icebergS3(s3_conn, filename='merged_several_tables_test', SETTINGS iceberg_metadata_table_uuid ='ea8d1178-7756-4b89-b21f-00e9f31fe03e') ORDER BY id;
5
+
SELECT*FROM icebergS3(s3_conn, filename='merged_several_tables_test', SETTINGS iceberg_metadata_table_uuid ='A90EED4CF74B4E5BB630096FB9D09021') ORDER BY id;
6
+
SELECT*FROM icebergS3(s3_conn, filename='merged_several_tables_test', SETTINGS iceberg_metadata_table_uuid ='6f6f6407_c6A5465f_A808ea8900_e35a38') ORDER BY id;
7
+
8
+
SELECTcount() FROM icebergS3(s3_conn, filename='merged_several_tables_test', SETTINGS iceberg_metadata_file_path ='metadata/00001-aec4e034-3f73-48f7-87ad-51b7b42a8db7.metadata.json');
9
+
SELECTcount() FROM icebergS3(s3_conn, filename='merged_several_tables_test', SETTINGS iceberg_metadata_file_path ='metadata/00001-2aad93a8-a893-4943-8504-f6021f83ecab.metadata.json');
10
+
SELECTcount() FROM icebergS3(s3_conn, filename='merged_several_tables_test', SETTINGS iceberg_metadata_file_path ='metadata/00001-aec4e034-3f73-48f7-87ad-51b7b42a8db7.metadata.json');
11
+
12
+
13
+
SELECT*FROM icebergS3(s3_conn, filename='merged_several_tables_test', SETTINGS iceberg_recent_metadata_file_by_last_updated_ms_field = true) ORDER BY id;
0 commit comments