Using DataFrame API's anti-join operation to efficiently compare large datasets, achieving up to 2x perf improvement #212
GGraziadei
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
In case of not ordering comparison and primary key defined I achieved a good result comparing two data frames using DataFrame api with anti-join.
The reason of this improvement, I suppose, is related to the data volume reduction performed on each partition, which reduces a lot the shuffling cost.
I propose this enhancement for the DatasetComparer
The following performance results were generated by executing the benchmark script on a MacBook M4 Pro workstation equipped with 24GB of RAM comparing actual implementation with the one which I proposed.
All reported data points represent the average across 20 independent simulations for each tested case, ensuring statistical reliability.
env1: local[8]
env2: local[4]
env3: local[10]
I asked an LLM to generate a graph
this is the benchmark
Beta Was this translation helpful? Give feedback.
All reactions