Skip to content

Conversation

rutb327
Copy link
Contributor

@rutb327 rutb327 commented Aug 11, 2025

Closes #2272
Collaborator: @geruh

Rationale for this change

Implements the validation logic described in #2272 to match Java and Rust behavior for partition field name conflicts with schema fields.
This mirrors the method in Java checkAndAddPartitionName():
https://github.com/apache/iceberg/blob/4dbc7f578eee7ceb9def35ebfa1a4cc236fb598f/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L392-L416

Identity transforms (sourceColumnID != null)- Allow schema field name conflicts only when sourced form the same field
Non-identity (sourceColumnID == null)- Disallow any schema field name conflicts.

In this PR isinstance(transform, (IdentityTransform, VoidTransform)) is used to achieve the same logic as Java’s sourceColumnID check.

Are these changes tested?

Yes, all existing tests pass and added a test covering validation scenarios.

Are there any user-facing changes?

Yes. Non-identity transforms can no longer use schema field names as partition field names.

@rutb327
Copy link
Contributor Author

rutb327 commented Aug 12, 2025

In Java all partition-schema validation goes through https://github.com/apache/iceberg/blob/4dbc7f578eee7ceb9def35ebfa1a4cc236fb598f/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L392-L416 during table creation with partition specs, partition spec updates and also during schema evolution.
In Python the validation in https://github.com/apache/iceberg-python/blob/d1c6005ad05166ab0fb08d3c15ccdfd7568e8013/pyiceberg/table/update/spec.py only covered partition spec updates
So, I've added the validation to:

Are these the correct locations for the validation logic, or should they be placed elsewhere?

Copy link
Contributor

@dingo4dev dingo4dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this!

To improve readability and keep related code together, what are your thoughts on placing all the partition validation logic inside the partitioning.py file? Centralizing it there could make the validation process easier for future contributors to find and understand.

Let me know what you think! @kevinjqliu

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! I left a few comments. I like how we check for conflict for both changes to the PartitionSpec and changes to the Schema

I've double checked that there are only 2 places that modifies PartitionSpec, assign_fresh_partition_spec_ids and UpdateSpec._apply and we covered both with tests :)
Similarly we cover the 1 place that modifies Schema in UpdateSchema._apply

I think both java and rust lack the test to check PartitionSpec for conflict when the Schema is changed

Comment on lines 67 to 69
def _create_table_with_schema(
catalog: Catalog, schema: Schema, format_version: str, partition_spec: Optional[PartitionSpec] = None
) -> Table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following other create table helpers in tests, for example

def _create_table(
session_catalog: Catalog,
identifier: str,
format_version: int,
location: str,
partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,
schema: Schema = TABLE_SCHEMA,
) -> Table:
try:
session_catalog.drop_table(identifier=identifier)
except NoSuchTableError:
pass
return session_catalog.create_table(
identifier=identifier,
schema=schema,
location=location,
properties={"format-version": str(format_version)},
partition_spec=partition_spec,
)

Suggested change
def _create_table_with_schema(
catalog: Catalog, schema: Schema, format_version: str, partition_spec: Optional[PartitionSpec] = None
) -> Table:
def _create_table_with_schema(
catalog: Catalog, schema: Schema, format_version: str, partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC
) -> Table:

Comment on lines 76 to 80
if partition_spec:
return catalog.create_table(
identifier=tbl_name, schema=schema, partition_spec=partition_spec, properties={"format-version": format_version}
)
return catalog.create_table(identifier=tbl_name, schema=schema, properties={"format-version": format_version})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then we can just do this

Suggested change
if partition_spec:
return catalog.create_table(
identifier=tbl_name, schema=schema, partition_spec=partition_spec, properties={"format-version": format_version}
)
return catalog.create_table(identifier=tbl_name, schema=schema, properties={"format-version": format_version})
return catalog.create_table(
identifier=tbl_name, schema=schema, partition_spec=partition_spec, properties={"format-version": format_version}
)

return # No conflict if field doesn't exist in schema

if isinstance(partition_transform, (IdentityTransform, VoidTransform)):
# For identity transforms, allow conflict only if sourced from the same schema field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# For identity transforms, allow conflict only if sourced from the same schema field
# For identity and void transforms, allow conflict only if sourced from the same schema field

Comment on lines 267 to 268
raise ValueError(f"Cannot create identity partition from a different source field in the schema: {field_name}")
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match java error message

Suggested change
raise ValueError(f"Cannot create identity partition from a different source field in the schema: {field_name}")
else:
raise ValueError(f"Cannot create identity partition sourced from different field in schema: {field_name}")
else:

) -> None:
from pyiceberg.partitioning import validate_partition_name

validate_partition_name(name, transform, source_id, schema)
if not name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that

Comment on lines +243 to +249
_check_and_add_partition_name(
self._transaction.table_metadata.schema(),
added_field.name,
added_field.source_id,
added_field.transform,
partition_names,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. just to confirm this covers the newly added partition fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's correct

Comment on lines 661 to 668
if self._transaction is not None:
from pyiceberg.partitioning import validate_partition_name

for spec in self._transaction.table_metadata.partition_specs:
for partition_field in spec.fields:
validate_partition_name(
partition_field.name, partition_field.transform, partition_field.source_id, new_schema
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there should always be a self._transaction

Suggested change
if self._transaction is not None:
from pyiceberg.partitioning import validate_partition_name
for spec in self._transaction.table_metadata.partition_specs:
for partition_field in spec.fields:
validate_partition_name(
partition_field.name, partition_field.transform, partition_field.source_id, new_schema
)
from pyiceberg.partitioning import validate_partition_name
for spec in self._transaction.table_metadata.partition_specs:
for partition_field in spec.fields:
validate_partition_name(
partition_field.name, partition_field.transform, partition_field.source_id, new_schema
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll do the suggested changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests show that transaction can be None in some cases, (after removing the check, tests from test_schema.py are failing). They use: UpdateSchema(transaction=None, schema=Schema())
https://github.com/rutb327/iceberg-python/blob/24b12ddd8fdab4a62650786a2c3cdd56a53f8719/tests/test_schema.py#L933

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like everywhere else in the codebase we include transaction in UpdateSchema.

Maybe we can update the tests like this

def test_add_top_level_primitives(primitive_fields: List[NestedField], table_v2: Table) -> None:
    for primitive_field in primitive_fields:
        new_schema = Schema(primitive_field)
        applied = UpdateSchema(transaction=Transaction(table_v2), schema=Schema()).union_by_name(new_schema)._apply()  # type: ignore
        assert applied == new_schema

@kevinjqliu
Copy link
Contributor

I opened apache/iceberg#13833 and apache/iceberg-rust#1609 for checking for name conflict during schema update

Comment on lines 661 to 668
if self._transaction is not None:
from pyiceberg.partitioning import validate_partition_name

for spec in self._transaction.table_metadata.partition_specs:
for partition_field in spec.fields:
validate_partition_name(
partition_field.name, partition_field.transform, partition_field.source_id, new_schema
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like everywhere else in the codebase we include transaction in UpdateSchema.

Maybe we can update the tests like this

def test_add_top_level_primitives(primitive_fields: List[NestedField], table_v2: Table) -> None:
    for primitive_field in primitive_fields:
        new_schema = Schema(primitive_field)
        applied = UpdateSchema(transaction=Transaction(table_v2), schema=Schema()).union_by_name(new_schema)._apply()  # type: ignore
        assert applied == new_schema

@Fokko Fokko merged commit 5a781df into apache:main Aug 20, 2025
10 checks passed
@Fokko
Copy link
Contributor

Fokko commented Aug 20, 2025

Let's move this forward, thanks @rutb327 for working on this, and thanks @kevinjqliu and @dingo4dev for the review 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature request] disallow creating partition field with name that conflicts with schema field when its not identity transform
4 participants