-
Notifications
You must be signed in to change notification settings - Fork 344
docs: clarify Parameters for the add_files API #2249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -1004,6 +1004,33 @@ To show only data files or delete files in the current snapshot, use `table.insp | |||||||||||||
|
||||||||||||||
Expert Iceberg users may choose to commit existing parquet files to the Iceberg table as data files, without rewriting them. | ||||||||||||||
|
||||||||||||||
<!-- prettier-ignore-start --> | ||||||||||||||
|
||||||||||||||
!!! note "Name Mapping" | ||||||||||||||
Because `add_files` uses existing files without writing new parquet files that are aware of the Iceberg's schema, it requires the Iceberg's table to have a [Name Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization) (The Name mapping maps the field names within the parquet files to the Iceberg field IDs). Hence, `add_files` requires that there are no field IDs in the parquet file's metadata, and creates a new Name Mapping based on the table's current schema if the table doesn't already have one. | ||||||||||||||
|
||||||||||||||
!!! note "Partitions" | ||||||||||||||
`add_files` only requires the client to read the existing parquet files' metadata footer to infer the partition value of each file. This implementation also supports adding files to Iceberg tables with partition transforms like `MonthTransform`, and `TruncateTransform` which preserve the order of the values after the transformation (Any Transform that has the `preserves_order` property set to True is supported). Please note that if the column statistics of the `PartitionField`'s source column are not present in the parquet metadata, the partition value is inferred as `None`. | ||||||||||||||
|
||||||||||||||
!!! warning "Maintenance Operations" | ||||||||||||||
Because `add_files` commits the existing parquet files to the Iceberg Table as any other data file, destructive maintenance operations like expiring snapshots will remove them. | ||||||||||||||
|
||||||||||||||
!!! warning "Check Duplicate Files" | ||||||||||||||
The `check_duplicate_files` parameter is `True` by default and will check the new files against the existing Iceberg table data files to prevent duplicates. This check can be expensive for large tables with many files. It is recommended to use the default configuration. The check can be turned off by setting `check_duplicate_files=False`, but this may result in duplicate files being added to the table, which can lead to data consistency issues and potential table corruption if the same data file is added multiple times. | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I used LLM to generate this based on the function definition. WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for your guidance; it makes a lot of sense. I appreciate your insights, and I believe it would be beneficial to offer users a clear recommendation. for example, it would be wise to suggest keeping the check_duplicate_files parameter set to True by default to help ensure data consistency and avoid accidental duplication. If users do experience performance bottlenecks related to the validation process, they might then consider setting it to False. |
||||||||||||||
|
||||||||||||||
<!-- prettier-ignore-end --> | ||||||||||||||
|
||||||||||||||
### Usage | ||||||||||||||
|
||||||||||||||
| Parameter | Required? | Type | Description | | ||||||||||||||
| ------------------------- | --------- | -------------- | ----------------------------------------------------------------------- | | ||||||||||||||
| `file_paths` | ✔️ | List[str] | The list of full file paths to be added as data files to the table | | ||||||||||||||
| `snapshot_properties` | | Dict[str, str] | Properties to set for the new snapshot. Defaults to an empty dictionary | | ||||||||||||||
| `check_duplicate_files` | | bool | Whether to check for duplicate files. Defaults to `True` | | ||||||||||||||
|
||||||||||||||
### Example | ||||||||||||||
|
||||||||||||||
Add files to Iceberg table: | ||||||||||||||
```python | ||||||||||||||
# Given that these parquet files have schema consistent with the Iceberg table | ||||||||||||||
|
||||||||||||||
|
@@ -1019,18 +1046,34 @@ tbl.add_files(file_paths=file_paths) | |||||||||||||
# A new snapshot is committed to the table with manifests pointing to the existing parquet files | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
<!-- prettier-ignore-start --> | ||||||||||||||
Add files to Iceberg table with custom snapshot properties: | ||||||||||||||
```python | ||||||||||||||
# Assume an existing Iceberg table object `tbl` | ||||||||||||||
|
||||||||||||||
!!! note "Name Mapping" | ||||||||||||||
Because `add_files` uses existing files without writing new parquet files that are aware of the Iceberg's schema, it requires the Iceberg's table to have a [Name Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization) (The Name mapping maps the field names within the parquet files to the Iceberg field IDs). Hence, `add_files` requires that there are no field IDs in the parquet file's metadata, and creates a new Name Mapping based on the table's current schema if the table doesn't already have one. | ||||||||||||||
file_paths = [ | ||||||||||||||
"s3a://warehouse/default/existing-1.parquet", | ||||||||||||||
"s3a://warehouse/default/existing-2.parquet", | ||||||||||||||
] | ||||||||||||||
|
||||||||||||||
!!! note "Partitions" | ||||||||||||||
`add_files` only requires the client to read the existing parquet files' metadata footer to infer the partition value of each file. This implementation also supports adding files to Iceberg tables with partition transforms like `MonthTransform`, and `TruncateTransform` which preserve the order of the values after the transformation (Any Transform that has the `preserves_order` property set to True is supported). Please note that if the column statistics of the `PartitionField`'s source column are not present in the parquet metadata, the partition value is inferred as `None`. | ||||||||||||||
# Custom snapshot properties | ||||||||||||||
snapshot_properties = {"abc": "def"} | ||||||||||||||
|
||||||||||||||
!!! warning "Maintenance Operations" | ||||||||||||||
Because `add_files` commits the existing parquet files to the Iceberg Table as any other data file, destructive maintenance operations like expiring snapshots will remove them. | ||||||||||||||
# Enable duplicate file checking | ||||||||||||||
check_duplicate_files = True | ||||||||||||||
|
||||||||||||||
<!-- prettier-ignore-end --> | ||||||||||||||
# Add the Parquet files to the Iceberg table without rewriting | ||||||||||||||
tbl.add_files( | ||||||||||||||
file_paths=file_paths, | ||||||||||||||
snapshot_properties=snapshot_properties, | ||||||||||||||
check_duplicate_files=check_duplicate_files | ||||||||||||||
) | ||||||||||||||
|
||||||||||||||
# NameMapping must have been set to enable reads | ||||||||||||||
assert tbl.name_mapping() is not None | ||||||||||||||
|
||||||||||||||
# Verify that the snapshot property was set correctly | ||||||||||||||
assert tbl.metadata.snapshots[-1].summary["abc"] == "def" | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
## Schema evolution | ||||||||||||||
|
||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the paragraphs with
!!!
are missing the 4 spaces, like how they are written initiallythe spaces are there to render these boxes correctly

could you add them back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your reply. I will submit this!