-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Parquet] Optimize appending max level comparison in DefinitionLevelDecoder #9217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| None => { | ||
| for def_level in def_levels { | ||
| bitmap_builder.append(*def_level >= self.struct_def_level) | ||
| // Safety: slice iterator has a trusted length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My benchmark did not actually cover this code, since it reads the leaf column as a primitive array. That would mean the performance might be even worse than measured, but also the improvement would be a bit bigger.
|
run benchmark arrow_reader arrow_reader_row_filter arrow_reader_clickbench |
7adb6ed to
f447bac
Compare
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it -- thank you @jhorstmann and @Dandandan
| for i in &levels[start..] { | ||
| nulls.append(i == max_level); | ||
| // Safety: slice iterator has a trusted length | ||
| unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that we could potentially try is to evaluate to a bitmap first and then append that bitmap directly via set_bits or something -- though now that I type this I am not sure that would be be any better 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be interesting to try out. It probably will only be faster when using explicit simd intrinsics to compute that bitmap.
Which issue does this PR close?
Rationale for this change
Profiling showed this loop to be a clear performance hotspot and thanks to the new
BooleanBufferBuilder::extend_trusted_lenmethod introduced in #9137 there is a very simple improvement.Benchmark results for optional structs after the fix:
This is a big improvement, but still significantly slower than reading non-nested data. The main hotspot still is the
extend_trusted_lenand manual simd code for comparing chunks and appending 64-bits at a time could potentially speed it up even more. But that would require either architecture specific intrinsics or unstable features.What changes are included in this PR?
Are these changes tested?
Tested by existing tests.
Are there any user-facing changes?
no