Skip to content

[Managed Iceberg] custom equals method for SerializedDataFile#33554

Merged
ahmedabu98 merged 6 commits intoapache:masterfrom
ahmedabu98:ice_df_equals
Feb 14, 2025
Merged

[Managed Iceberg] custom equals method for SerializedDataFile#33554
ahmedabu98 merged 6 commits intoapache:masterfrom
ahmedabu98:ice_df_equals

Conversation

@ahmedabu98
Copy link
Copy Markdown
Contributor

@ahmedabu98 ahmedabu98 commented Jan 9, 2025

Adding a more accurate equals method for SerializedDataFile because the default equals method does not account for objects of type Map<Integer, byte[]>.

This helps clean up logs like these, which are actually extremely noisy:
WARNING: Coder of type class org.apache.beam.sdk.schemas.SchemaCoder has a #structuralValue method which does not return true when the encoding of the elements is equal. Element FileWriteResult{tableIdentifierString={"namespace":["managed_iceberg_bqms_tests_no_delete"],"name":"testWriteToPartitionedTable_667244372263427"}, serializableDataFile=SerializableDataFile{path=gs://managed-iceberg-integration-tests/BigQueryMetastoreCatalogIT/408aefd5-2598-4fc3-b2e1-451500fff0c8/managed_iceberg_bqms_tests_no_delete.db/testWriteToPartitionedTable_667244372263427/data/bool=true/datetime_hour=1970-01-01-00/str_trunc=value_1/23583901-d389-4eca-a748-f7194ce06fbc_dae547f4-509e-4fd6-a805-b23cb94e2b55_1.parquet, fileFormat=PARQUET, recordCount=16, fileSizeInBytes=7052, partitionPath=bool=true/datetime_hour=0/str_trunc=value_1, partitionSpecId=0, keyMetadata=null, splitOffsets=[4], columnSizes={1=99, 2=106, 3=102, 4=45, 5=80, 9=91, 10=107, 11=83, 12=80, 13=83, 14=115, 16=80, 17=110, 18=121, 19=83, 20=151, 21=49, 23=49, 24=49, 25=49, 26=49}, valueCounts={1=16, 2=16, 3=16, 4=16, 5=16, 9=16, 10=16, 11=16, 12=16, 13=16, 14=16, 16=16, 17=16, 18=16, 19=16, 20=67, 21=16, 23=16, 24=16, 25=16, 26=16}, nullValueCounts={1=0, 2=0, 3=0, 4=0, 5=0, 9=0, 10=0, 11=0, 12=0, 13=0, 14=0, 16=0, 17=0, 18=0, 19=0, 20=3, 21=16, 23=16, 24=16, 25=16, 26=16}, nanValueCounts={17=0, 24=0}, lowerBounds={1=[B@1b0b6e85, 2=[B@76cb6b52, 3=[B@14752039, 4=[B@5b22e2ef, 5=[B@1b1dd048, 9=[B@36a1b982, 10=[B@be82387, 11=[B@47918d8f, 12=[B@32fb8aa5, 13=[B@9ad58d1, 14=[B@4313a678, 16=[B@825b660, 17=[B@66e5bb63, 18=[B@56d0cbe, 19=[B@5b135a0d}, upperBounds={1=[B@7ea061b9, 2=[B@35d5e3ad, 3=[B@58300155, 4=[B@19b2fb6c, 5=[B@64e2fc4a, 9=[B@7e30a3f0, 10=[B@26d22fca, 11=[B@2835d9e7, 12=[B@5bc66007, 13=[B@209267ac, 14=[B@1789b00b, 16=[B@1175b9fc, 17=[B@79db358d, 18=[B@4ac7821d, 19=[B@10146aa6}}}

Fixes #33776

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 9, 2025

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@github-actions
Copy link
Copy Markdown
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @damccorm for label build.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@github-actions
Copy link
Copy Markdown
Contributor

Reminder, please take a look at this pr: @kennknowles @damccorm @Abacn

@github-actions
Copy link
Copy Markdown
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @damccorm for label build.
R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@damccorm
Copy link
Copy Markdown
Contributor

@kennknowles could you take another look at this one?

Copy link
Copy Markdown
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back, can you provide context on this? We hit kind of the same issue with ByteArrayCoder and implement it with StructuralByteArray (and this is what the structuralValue method is for).

@kennknowles
Copy link
Copy Markdown
Member

I'm OK to go ahead with this approach but it should be temporary. We shouldn't be implementing equals and hashcode manually for things like this. It is simply too error prone and is guaranteed to bite us.

Lets be sure this is done in a compatible way so we can migrate to types that "just work" with our various automations, and upgrade the automation to work with ByteBuffer. Basically Java arrays are to be avoided except in low-level cases.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 1, 2025

Reminder, please take a look at this pr: @robertwb @damccorm @chamikaramj

@github-actions
Copy link
Copy Markdown
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.
R: @Abacn for label build.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

.build();

@Test
public void testFieldsInEqualsMethodInSyncWithGetterFields() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kennknowles added a check here with a helpful error message

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline, and I think using EqualityTester could also be a good way to do this. Though if all the examples we pass to it have the default value for new fields, it will still miss errors...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, EqualsTester just simplifies testing the equals() method for different objects, but it does not help us determine if all fields are accounted for.

Copy link
Copy Markdown
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we've spent more time on this than it deserves. Might as well merge it. I think at the code level the right approach is switching to the data type and fixing our codegen. This adds tech debt that is almost certain to bite us eventually, but for now we need to make some temporary progress.

@ahmedabu98 ahmedabu98 merged commit 64b0e76 into apache:master Feb 14, 2025
19 checks passed
VardhanThigle pushed a commit to VardhanThigle/beam that referenced this pull request Mar 21, 2025
…#33554)

* add better equals method for SerializedDataFile

* add hashcode impl

* spotless

* add test to check newly added fields; simplify hashcode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Iceberg integration test warning log spam

3 participants