[Managed Iceberg] custom equals method for SerializedDataFile#33554
[Managed Iceberg] custom equals method for SerializedDataFile#33554ahmedabu98 merged 6 commits intoapache:masterfrom
Conversation
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
Assigning reviewers. If you would like to opt out of this review, comment R: @kennknowles for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SerializableDataFile.java
Outdated
Show resolved
Hide resolved
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SerializableDataFile.java
Show resolved
Hide resolved
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SerializableDataFile.java
Show resolved
Hide resolved
|
Reminder, please take a look at this pr: @kennknowles @damccorm @Abacn |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @robertwb for label java. Available commands:
|
|
@kennknowles could you take another look at this one? |
kennknowles
left a comment
There was a problem hiding this comment.
Stepping back, can you provide context on this? We hit kind of the same issue with ByteArrayCoder and implement it with StructuralByteArray (and this is what the structuralValue method is for).
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SerializableDataFile.java
Outdated
Show resolved
Hide resolved
|
I'm OK to go ahead with this approach but it should be temporary. We shouldn't be implementing equals and hashcode manually for things like this. It is simply too error prone and is guaranteed to bite us. Lets be sure this is done in a compatible way so we can migrate to types that "just work" with our various automations, and upgrade the automation to work with ByteBuffer. Basically Java arrays are to be avoided except in low-level cases. |
|
Reminder, please take a look at this pr: @robertwb @damccorm @chamikaramj |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @Abacn for label java. Available commands:
|
| .build(); | ||
|
|
||
| @Test | ||
| public void testFieldsInEqualsMethodInSyncWithGetterFields() { |
There was a problem hiding this comment.
@kennknowles added a check here with a helpful error message
There was a problem hiding this comment.
We talked offline, and I think using EqualityTester could also be a good way to do this. Though if all the examples we pass to it have the default value for new fields, it will still miss errors...
There was a problem hiding this comment.
Right, EqualsTester just simplifies testing the equals() method for different objects, but it does not help us determine if all fields are accounted for.
kennknowles
left a comment
There was a problem hiding this comment.
At this point we've spent more time on this than it deserves. Might as well merge it. I think at the code level the right approach is switching to the data type and fixing our codegen. This adds tech debt that is almost certain to bite us eventually, but for now we need to make some temporary progress.
…#33554) * add better equals method for SerializedDataFile * add hashcode impl * spotless * add test to check newly added fields; simplify hashcode
Adding a more accurate equals method for SerializedDataFile because the default equals method does not account for objects of type
Map<Integer, byte[]>.This helps clean up logs like these, which are actually extremely noisy:
WARNING: Coder of type class org.apache.beam.sdk.schemas.SchemaCoder has a #structuralValue method which does not return true when the encoding of the elements is equal. Element FileWriteResult{tableIdentifierString={"namespace":["managed_iceberg_bqms_tests_no_delete"],"name":"testWriteToPartitionedTable_667244372263427"}, serializableDataFile=SerializableDataFile{path=gs://managed-iceberg-integration-tests/BigQueryMetastoreCatalogIT/408aefd5-2598-4fc3-b2e1-451500fff0c8/managed_iceberg_bqms_tests_no_delete.db/testWriteToPartitionedTable_667244372263427/data/bool=true/datetime_hour=1970-01-01-00/str_trunc=value_1/23583901-d389-4eca-a748-f7194ce06fbc_dae547f4-509e-4fd6-a805-b23cb94e2b55_1.parquet, fileFormat=PARQUET, recordCount=16, fileSizeInBytes=7052, partitionPath=bool=true/datetime_hour=0/str_trunc=value_1, partitionSpecId=0, keyMetadata=null, splitOffsets=[4], columnSizes={1=99, 2=106, 3=102, 4=45, 5=80, 9=91, 10=107, 11=83, 12=80, 13=83, 14=115, 16=80, 17=110, 18=121, 19=83, 20=151, 21=49, 23=49, 24=49, 25=49, 26=49}, valueCounts={1=16, 2=16, 3=16, 4=16, 5=16, 9=16, 10=16, 11=16, 12=16, 13=16, 14=16, 16=16, 17=16, 18=16, 19=16, 20=67, 21=16, 23=16, 24=16, 25=16, 26=16}, nullValueCounts={1=0, 2=0, 3=0, 4=0, 5=0, 9=0, 10=0, 11=0, 12=0, 13=0, 14=0, 16=0, 17=0, 18=0, 19=0, 20=3, 21=16, 23=16, 24=16, 25=16, 26=16}, nanValueCounts={17=0, 24=0}, lowerBounds={1=[B@1b0b6e85, 2=[B@76cb6b52, 3=[B@14752039, 4=[B@5b22e2ef, 5=[B@1b1dd048, 9=[B@36a1b982, 10=[B@be82387, 11=[B@47918d8f, 12=[B@32fb8aa5, 13=[B@9ad58d1, 14=[B@4313a678, 16=[B@825b660, 17=[B@66e5bb63, 18=[B@56d0cbe, 19=[B@5b135a0d}, upperBounds={1=[B@7ea061b9, 2=[B@35d5e3ad, 3=[B@58300155, 4=[B@19b2fb6c, 5=[B@64e2fc4a, 9=[B@7e30a3f0, 10=[B@26d22fca, 11=[B@2835d9e7, 12=[B@5bc66007, 13=[B@209267ac, 14=[B@1789b00b, 16=[B@1175b9fc, 17=[B@79db358d, 18=[B@4ac7821d, 19=[B@10146aa6}}}Fixes #33776