[Managed Iceberg] custom equals method for SerializedDataFile by ahmedabu98 · Pull Request #33554 · apache/beam

ahmedabu98 · 2025-01-09T22:27:54Z

Adding a more accurate equals method for SerializedDataFile because the default equals method does not account for objects of type Map<Integer, byte[]>.

This helps clean up logs like these, which are actually extremely noisy:
WARNING: Coder of type class org.apache.beam.sdk.schemas.SchemaCoder has a #structuralValue method which does not return true when the encoding of the elements is equal. Element FileWriteResult{tableIdentifierString={"namespace":["managed_iceberg_bqms_tests_no_delete"],"name":"testWriteToPartitionedTable_667244372263427"}, serializableDataFile=SerializableDataFile{path=gs://managed-iceberg-integration-tests/BigQueryMetastoreCatalogIT/408aefd5-2598-4fc3-b2e1-451500fff0c8/managed_iceberg_bqms_tests_no_delete.db/testWriteToPartitionedTable_667244372263427/data/bool=true/datetime_hour=1970-01-01-00/str_trunc=value_1/23583901-d389-4eca-a748-f7194ce06fbc_dae547f4-509e-4fd6-a805-b23cb94e2b55_1.parquet, fileFormat=PARQUET, recordCount=16, fileSizeInBytes=7052, partitionPath=bool=true/datetime_hour=0/str_trunc=value_1, partitionSpecId=0, keyMetadata=null, splitOffsets=[4], columnSizes={1=99, 2=106, 3=102, 4=45, 5=80, 9=91, 10=107, 11=83, 12=80, 13=83, 14=115, 16=80, 17=110, 18=121, 19=83, 20=151, 21=49, 23=49, 24=49, 25=49, 26=49}, valueCounts={1=16, 2=16, 3=16, 4=16, 5=16, 9=16, 10=16, 11=16, 12=16, 13=16, 14=16, 16=16, 17=16, 18=16, 19=16, 20=67, 21=16, 23=16, 24=16, 25=16, 26=16}, nullValueCounts={1=0, 2=0, 3=0, 4=0, 5=0, 9=0, 10=0, 11=0, 12=0, 13=0, 14=0, 16=0, 17=0, 18=0, 19=0, 20=3, 21=16, 23=16, 24=16, 25=16, 26=16}, nanValueCounts={17=0, 24=0}, lowerBounds={1=[B@1b0b6e85, 2=[B@76cb6b52, 3=[B@14752039, 4=[B@5b22e2ef, 5=[B@1b1dd048, 9=[B@36a1b982, 10=[B@be82387, 11=[B@47918d8f, 12=[B@32fb8aa5, 13=[B@9ad58d1, 14=[B@4313a678, 16=[B@825b660, 17=[B@66e5bb63, 18=[B@56d0cbe, 19=[B@5b135a0d}, upperBounds={1=[B@7ea061b9, 2=[B@35d5e3ad, 3=[B@58300155, 4=[B@19b2fb6c, 5=[B@64e2fc4a, 9=[B@7e30a3f0, 10=[B@26d22fca, 11=[B@2835d9e7, 12=[B@5bc66007, 13=[B@209267ac, 14=[B@1789b00b, 16=[B@1175b9fc, 17=[B@79db358d, 18=[B@4ac7821d, 19=[B@10146aa6}}}

Fixes #33776

github-actions · 2025-01-09T23:34:39Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions · 2025-01-10T01:34:02Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @damccorm for label build.
R: @Abacn for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SerializableDataFile.java

github-actions · 2025-01-21T12:14:17Z

Reminder, please take a look at this pr: @kennknowles @damccorm @Abacn

github-actions · 2025-01-23T12:14:29Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @damccorm for label build.
R: @chamikaramj for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

damccorm · 2025-01-24T16:23:25Z

@kennknowles could you take another look at this one?

kennknowles

Stepping back, can you provide context on this? We hit kind of the same issue with ByteArrayCoder and implement it with StructuralByteArray (and this is what the structuralValue method is for).

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SerializableDataFile.java

kennknowles · 2025-01-24T17:41:05Z

I'm OK to go ahead with this approach but it should be temporary. We shouldn't be implementing equals and hashcode manually for things like this. It is simply too error prone and is guaranteed to bite us.

Lets be sure this is done in a compatible way so we can migrate to types that "just work" with our various automations, and upgrade the automation to work with ByteBuffer. Basically Java arrays are to be avoided except in low-level cases.

github-actions · 2025-02-01T12:14:08Z

Reminder, please take a look at this pr: @robertwb @damccorm @chamikaramj

github-actions · 2025-02-10T12:14:21Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.
R: @Abacn for label build.
R: @Abacn for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

…to ice_df_equals

ahmedabu98 · 2025-02-10T21:47:04Z

sdks/java/io/iceberg/src/test/java/org/apache/beam/sdk/io/iceberg/SerializableDataFileTest.java

+          .build();
+
+  @Test
+  public void testFieldsInEqualsMethodInSyncWithGetterFields() {


@kennknowles added a check here with a helpful error message

We talked offline, and I think using EqualityTester could also be a good way to do this. Though if all the examples we pass to it have the default value for new fields, it will still miss errors...

Right, EqualsTester just simplifies testing the equals() method for different objects, but it does not help us determine if all fields are accounted for.

kennknowles

At this point we've spent more time on this than it deserves. Might as well merge it. I think at the code level the right approach is switching to the data type and fixing our codegen. This adds tech debt that is almost certain to bite us eventually, but for now we need to make some temporary progress.

…#33554) * add better equals method for SerializedDataFile * add hashcode impl * spotless * add test to check newly added fields; simplify hashcode

add better equals method for SerializedDataFile

18eca58

github-actions bot added java build io labels Jan 9, 2025

github-actions bot added the Next Action: Reviewers label Jan 10, 2025

kennknowles requested changes Jan 10, 2025

View reviewed changes

ahmedabu98 added 2 commits January 13, 2025 10:50

add hashcode impl

123c214

spotless

47eab60

github-actions bot added the slow-review label Jan 21, 2025

github-actions bot removed the slow-review label Jan 23, 2025

kennknowles reviewed Jan 24, 2025

View reviewed changes

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SerializableDataFile.java Outdated Show resolved Hide resolved

Merge branch 'master' into ice_df_equals

95013b5

github-actions bot added the slow-review label Feb 1, 2025

ahmedabu98 mentioned this pull request Feb 4, 2025

[Bug]: Iceberg integration test warning log spam #33776

Closed

17 tasks

github-actions bot removed the slow-review label Feb 10, 2025

ahmedabu98 added 2 commits February 10, 2025 16:35

add test to check newly added fields; simplify hashcode

970bdc9

Merge branch 'ice_df_equals' of https://github.com/ahmedabu98/beam in…

70503a8

…to ice_df_equals

ahmedabu98 commented Feb 10, 2025

View reviewed changes

kennknowles reviewed Feb 12, 2025

View reviewed changes

kennknowles approved these changes Feb 12, 2025

View reviewed changes

ahmedabu98 merged commit 64b0e76 into apache:master Feb 14, 2025
19 checks passed

Conversation

ahmedabu98 commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 9, 2025

Uh oh!

github-actions bot commented Jan 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 21, 2025

Uh oh!

github-actions bot commented Jan 23, 2025

Uh oh!

damccorm commented Jan 24, 2025

Uh oh!

kennknowles left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kennknowles commented Jan 24, 2025

Uh oh!

github-actions bot commented Feb 1, 2025

Uh oh!

github-actions bot commented Feb 10, 2025

Uh oh!

ahmedabu98 Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

kennknowles Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

ahmedabu98 Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

kennknowles left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ahmedabu98 commented Jan 9, 2025 •

edited

Loading