-
Notifications
You must be signed in to change notification settings - Fork 234
chore: Add spark compatible MapSort
function along with limited support for grouping on Map type
#2221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2221 +/- ##
============================================
+ Coverage 56.12% 58.54% +2.41%
- Complexity 976 1281 +305
============================================
Files 119 143 +24
Lines 11743 13265 +1522
Branches 2251 2367 +116
============================================
+ Hits 6591 7766 +1175
- Misses 4012 4265 +253
- Partials 1140 1234 +94 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PR Build (macOS) / macos-14/Spark 3.4, JDK 11, Scala 2.12 [exec] (pull_request) Looks like this one failed for,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rishvin I feel like this PR would be hard to review, would be that possible to break it down to smaller parts? One function a time?
Thanks @comphead for the feedback. Yes, I will open individual PRs for each functions. I will also keep this PR open to refer how those functions are integrated at the moment. Later, I can use this same PR to make the final integration changes. I will change the status of this PR to draft for now. |
Which issue does this PR close?
Closes #1941
Rationale for this change
This PR introduces
CometMapSort
which is spark compatibleMapSort
function. SparkMapSort
was introduced in Spark-4.0 to allows grouping onMap
type. It allows this by sorting the map by keys before doing the group by.Today, DataFusion/Arrow does not support grouping on map type, as such executing
CometMapSort
as a grouping expression will fail. To make it work, this PR introduces additional changes to allow grouping on map type, however the support is limited at the moment.What changes are included in this PR?
map_sort
scalar function which sorts the Map type. This is compatible with Spark'sMapSort
.map_to_list
scalar function which converts theMap
type toList<Struct<K, V>>
. The functions gets wrapped before passing the Map type to the hash aggregate for grouping. This conversion from Map type to List should be cheap as the physical layout is maintained.map_from_list
scalar function which converts theList<Struct<K, V>>
back to the Map type. This function is applied to the output of the hash aggregate, so that the grouping keys returned by the hash aggregate is still Map type. This is important to ensure the schema consistency. This conversion should also be cheap.HashAggregateMapConverter
. It provides helpers to wrap grouping expressions withmap_to_list
and wrap the output of the hash aggregate withmap_from_list
to get the map type back.Limitations (Future work)
Partial
aggregation is executed natively, howeverFinal
aggregation falls back to the Spark. This is due to unsupported map-type in result-expression. Result expression is sent for final aggregation only. This work can be extended to supportFinal
aggregation in the future.How are these changes tested?