Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Dec 2, 2025

Which issue does this PR close?

Closes #1927

Rationale for this change

explode is widely used in Spark jobs that work with arrays.

What changes are included in this PR?

Add support for explode for arrays, but not maps yet. I filed #2837 for adding support for maps.

High level changes:

  • New Explode operator in protobuf
  • CometExplodeExec contains serde code for JVM
  • Native code delegates to UnnestExec
  • Tests are in new CometGenerateExecSuite

How are these changes tested?

metrics-explode
OpenJDK 64-Bit Server VM 17.0.17+10-Ubuntu-122.04 on Linux 6.8.0-87-generic
AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
explode                                              65             68           2          3.2         316.9       1.0X
explode: Comet                                       48             52           2          4.2         236.6       1.3X

@andygrove andygrove changed the title feat: Add support for explode and explode_outer [WIP] feat: Add support for explode and explode_outer Dec 2, 2025
@andygrove andygrove changed the title feat: Add support for explode and explode_outer feat: Add support for explode and explode_outer for array inputs Dec 2, 2025
@mbutrovich
Copy link
Contributor

mbutrovich commented Dec 2, 2025

Awesome stuff @andygrove! DataFusion has unnest for arrays and maps, but I don't know if the semantics perfectly match Spark explode behavior https://datafusion.apache.org/user-guide/sql/special_functions.html

@andygrove andygrove changed the title feat: Add support for explode and explode_outer for array inputs feat: Add support for explode and explode_outer for array inputs [WIP] Dec 2, 2025
@andygrove
Copy link
Member Author

Awesome stuff @andygrove! DataFusion has unnest for arrays and maps, but I don't know if the semantics perfectly match Spark explode behavior https://datafusion.apache.org/user-guide/sql/special_functions.html

Thanks! I am trying this out now.

@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 68.05556% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.19%. Comparing base (f09f8af) to head (a65a432).
⚠️ Report is 736 commits behind head on main.

Files with missing lines Patch % Lines
...n/scala/org/apache/spark/sql/comet/operators.scala 66.66% 10 Missing and 13 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2836      +/-   ##
============================================
+ Coverage     56.12%   59.19%   +3.06%     
- Complexity      976     1473     +497     
============================================
  Files           119      167      +48     
  Lines         11743    15307    +3564     
  Branches       2251     2530     +279     
============================================
+ Hits           6591     9061    +2470     
- Misses         4012     4952     +940     
- Partials       1140     1294     +154     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove
Copy link
Member Author

Awesome stuff @andygrove! DataFusion has unnest for arrays and maps, but I don't know if the semantics perfectly match Spark explode behavior https://datafusion.apache.org/user-guide/sql/special_functions.html

Thanks! I am trying this out now.

There does appear to be a different in the handling of empty arrays in in the explode_outer case, but we can fall back to Spark for now for that case until we can get a fix into DataFusion core.

@andygrove
Copy link
Member Author

Awesome stuff @andygrove! DataFusion has unnest for arrays and maps, but I don't know if the semantics perfectly match Spark explode behavior https://datafusion.apache.org/user-guide/sql/special_functions.html

Thanks! I am trying this out now.

There does appear to be a different in the handling of empty arrays in in the explode_outer case, but we can fall back to Spark for now for that case until we can get a fix into DataFusion core.

I filed an issue in DataFusion: apache/datafusion#19053

@andygrove andygrove changed the title feat: Add support for explode and explode_outer for array inputs [WIP] feat: Add support for explode and explode_outer for array inputs Dec 2, 2025
@andygrove andygrove marked this pull request as ready for review December 2, 2025 21:32
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove it mostly looks good

@@ -0,0 +1,4 @@
SELECT i_item_sk, explode(array(i_brand_id, i_class_id, i_category_id, i_manufact_id, i_manager_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also have explode_outer?

));
};

// Create projection expressions for other columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other columns? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, as in SELECT a, b, c, explode(d) FROM ...

.collect();

// Add the array column as the last column
let array_col_name = format!("col_{}", projections.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this name cause a conflict or issue if original dataset has col_* cols?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this now and preserve original names

output_fields.push(Field::new(
array_field.name(),
element_type,
true, // Element is nullable after unnesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@andygrove
Copy link
Member Author

One Spark SQL test needs rewriting to work with Comet. I am working on it.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove just one nit on: if we need to have a microbenchmark on explode_outer but it can be done in follow up PR

@andygrove
Copy link
Member Author

Thanks @andygrove just one nit on: if we need to have a microbenchmark on explode_outer but it can be done in follow up PR

Thanks. I added this to the scope of #2838.

@andygrove andygrove merged commit 0bda9d2 into apache:main Dec 5, 2025
115 checks passed
@andygrove andygrove deleted the explode branch December 5, 2025 18:13
@bjornjorgensen
Copy link
Contributor

The supported list is not updated

|explode |{FAILED, [{SELECT explode(array(10, 20));, Unsupported}]} |

@comphead
Copy link
Contributor

comphead commented Dec 5, 2025

The supported list is not updated

|explode |{FAILED, [{SELECT explode(array(10, 20));, Unsupported}]} |

Thanks @bjornjorgensen this is autogenerated doc thats being triggered manually, I'm actually thinking we can deprecate it

@bjornjorgensen
Copy link
Contributor

ohh.. ok. I just have a look at this project and read "You may have a specific expression in mind that you’d like to add, but if not, you can review the expression coverage document to see which expressions are not yet supported." https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html

@comphead
Copy link
Contributor

comphead commented Dec 5, 2025

ohh.. ok. I just have a look at this project and read "You may have a specific expression in mind that you’d like to add, but if not, you can review the expression coverage document to see which expressions are not yet supported." https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html

Thanks for reporting this, the correct doc would be https://github.com/apache/datafusion-comet/blob/main/docs/spark_expressions_support.md

I updated the manual #2854

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for Spark SQL explode expression

5 participants