Skip to content

Commit 8013545

Browse files
Xiezhibin“Zhibin
andauthored
docs: clarify Parameters for the add_files API (#2249)
## Summary Related Issue: #2132 1. This PR enhances the documentation for the add_files API by: 2. Adding a parameter table to clarify the required and optional inputs and outputs. 3. Providing a complete example that includes all parameters, such as snapshot_properties and check_duplicate_files. 4. Strengthening the warning regarding the default setting of check_duplicate_files=True and the associated risks of disabling it. --------- Co-authored-by: “Zhibin <[email protected]>
1 parent a7f6c08 commit 8013545

File tree

1 file changed

+51
-8
lines changed

1 file changed

+51
-8
lines changed

mkdocs/docs/api.md

Lines changed: 51 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1004,6 +1004,33 @@ To show only data files or delete files in the current snapshot, use `table.insp
10041004

10051005
Expert Iceberg users may choose to commit existing parquet files to the Iceberg table as data files, without rewriting them.
10061006

1007+
<!-- prettier-ignore-start -->
1008+
1009+
!!! note "Name Mapping"
1010+
Because `add_files` uses existing files without writing new parquet files that are aware of the Iceberg's schema, it requires the Iceberg's table to have a [Name Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization) (The Name mapping maps the field names within the parquet files to the Iceberg field IDs). Hence, `add_files` requires that there are no field IDs in the parquet file's metadata, and creates a new Name Mapping based on the table's current schema if the table doesn't already have one.
1011+
1012+
!!! note "Partitions"
1013+
`add_files` only requires the client to read the existing parquet files' metadata footer to infer the partition value of each file. This implementation also supports adding files to Iceberg tables with partition transforms like `MonthTransform`, and `TruncateTransform` which preserve the order of the values after the transformation (Any Transform that has the `preserves_order` property set to True is supported). Please note that if the column statistics of the `PartitionField`'s source column are not present in the parquet metadata, the partition value is inferred as `None`.
1014+
1015+
!!! warning "Maintenance Operations"
1016+
Because `add_files` commits the existing parquet files to the Iceberg Table as any other data file, destructive maintenance operations like expiring snapshots will remove them.
1017+
1018+
!!! warning "Check Duplicate Files"
1019+
The `check_duplicate_files` parameter determines whether the method validates that the specified `file_paths` do not already exist in the Iceberg table. When set to True (the default), the method performs a validation against the table’s current data files to prevent accidental duplication, helping to maintain data consistency by ensuring the same file is not added multiple times. While this check is important for data integrity, it can introduce performance overhead for tables with a large number of files. Setting check_duplicate_files=False can improve performance but increases the risk of duplicate files, which may lead to data inconsistencies or table corruption. It is strongly recommended to keep this parameter enabled unless duplicate file handling is strictly enforced elsewhere.
1020+
1021+
<!-- prettier-ignore-end -->
1022+
1023+
### Usage
1024+
1025+
| Parameter | Required? | Type | Description |
1026+
| ------------------------- | --------- | -------------- | ----------------------------------------------------------------------- |
1027+
| `file_paths` | ✔️ | List[str] | The list of full file paths to be added as data files to the table |
1028+
| `snapshot_properties` | | Dict[str, str] | Properties to set for the new snapshot. Defaults to an empty dictionary |
1029+
| `check_duplicate_files` | | bool | Whether to check for duplicate files. Defaults to `True` |
1030+
1031+
### Example
1032+
1033+
Add files to Iceberg table:
10071034
```python
10081035
# Given that these parquet files have schema consistent with the Iceberg table
10091036
@@ -1019,18 +1046,34 @@ tbl.add_files(file_paths=file_paths)
10191046
# A new snapshot is committed to the table with manifests pointing to the existing parquet files
10201047
```
10211048

1022-
<!-- prettier-ignore-start -->
1049+
Add files to Iceberg table with custom snapshot properties:
1050+
```python
1051+
# Assume an existing Iceberg table object `tbl`
10231052

1024-
!!! note "Name Mapping"
1025-
Because `add_files` uses existing files without writing new parquet files that are aware of the Iceberg's schema, it requires the Iceberg's table to have a [Name Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization) (The Name mapping maps the field names within the parquet files to the Iceberg field IDs). Hence, `add_files` requires that there are no field IDs in the parquet file's metadata, and creates a new Name Mapping based on the table's current schema if the table doesn't already have one.
1053+
file_paths = [
1054+
"s3a://warehouse/default/existing-1.parquet",
1055+
"s3a://warehouse/default/existing-2.parquet",
1056+
]
10261057

1027-
!!! note "Partitions"
1028-
`add_files` only requires the client to read the existing parquet files' metadata footer to infer the partition value of each file. This implementation also supports adding files to Iceberg tables with partition transforms like `MonthTransform`, and `TruncateTransform` which preserve the order of the values after the transformation (Any Transform that has the `preserves_order` property set to True is supported). Please note that if the column statistics of the `PartitionField`'s source column are not present in the parquet metadata, the partition value is inferred as `None`.
1058+
# Custom snapshot properties
1059+
snapshot_properties = {"abc": "def"}
10291060

1030-
!!! warning "Maintenance Operations"
1031-
Because `add_files` commits the existing parquet files to the Iceberg Table as any other data file, destructive maintenance operations like expiring snapshots will remove them.
1061+
# Enable duplicate file checking
1062+
check_duplicate_files = True
10321063

1033-
<!-- prettier-ignore-end -->
1064+
# Add the Parquet files to the Iceberg table without rewriting
1065+
tbl.add_files(
1066+
file_paths=file_paths,
1067+
snapshot_properties=snapshot_properties,
1068+
check_duplicate_files=check_duplicate_files
1069+
)
1070+
1071+
# NameMapping must have been set to enable reads
1072+
assert tbl.name_mapping() is not None
1073+
1074+
# Verify that the snapshot property was set correctly
1075+
assert tbl.metadata.snapshots[-1].summary["abc"] == "def"
1076+
```
10341077

10351078
## Schema evolution
10361079

0 commit comments

Comments
 (0)