Skip to content

Comments

[SPARK-55558][SQL] Add support for Tuple/Theta set operations#54338

Open
cboumalh wants to merge 8 commits intoapache:masterfrom
cboumalh:cboumalh-tuple-theta-sketch
Open

[SPARK-55558][SQL] Add support for Tuple/Theta set operations#54338
cboumalh wants to merge 8 commits intoapache:masterfrom
cboumalh:cboumalh-tuple-theta-sketch

Conversation

@cboumalh
Copy link
Contributor

@cboumalh cboumalh commented Feb 16, 2026

What changes were proposed in this pull request?

This PR adds support for set operations (union, intersection, difference) between Apache DataSketches TupleSketch and ThetaSketch objects. Specifically, it adds the following new SQL functions:

New Functions:

tuple_union_theta_double - Unions a TupleSketch (with double summaries) with a ThetaSketch
tuple_union_theta_integer - Unions a TupleSketch (with integer summaries) with a ThetaSketch
tuple_intersection_theta_double - Intersects a TupleSketch (with double summaries) with a ThetaSketch
tuple_intersection_theta_integer - Intersects a TupleSketch (with integer summaries) with a ThetaSketch
tuple_difference_theta_double - Subtracts a ThetaSketch from a TupleSketch (with double summaries)
tuple_difference_theta_integer - Subtracts a ThetaSketch from a TupleSketch (with integer summaries)

https://issues.apache.org/jira/browse/SPARK-55558

Why are the changes needed?

This PR extends the existing TupleSketch functionality to support interoperability with ThetaSketch objects. Previously, users could only perform set operations between two TupleSketches or two ThetaSketches, but not between different sketch types. This enhancement allows users to:

  1. Combine cardinality estimates from TupleSketch (which tracks summaries) with ThetaSketch (cardinality-only)
  2. Perform set operations across different sketch types in the same query
  3. Migrate from ThetaSketch to TupleSketch incrementally by supporting mixed operations

Does this PR introduce any user-facing change?

Yes. This PR adds six new SQL functions for performing set operations between TupleSketch and ThetaSketch objects:

-- Union a TupleSketch with a ThetaSketch
SELECT tuple_sketch_estimate_double(
  tuple_union_theta_double(
    tuple_sketch_agg_double(col1, val1),
    theta_sketch_agg(col2)))
FROM table;

-- Intersect a TupleSketch with a ThetaSketch  
SELECT tuple_sketch_estimate_double(
  tuple_intersection_theta_double(
    tuple_sketch_agg_double(col1, val1),
    theta_sketch_agg(col2)))
FROM table;

-- Subtract a ThetaSketch from a TupleSketch
SELECT tuple_sketch_estimate_double(
  tuple_difference_theta_double(
    tuple_sketch_agg_double(col1, val1),
    theta_sketch_agg(col2)))
FROM table;

How was this patch tested?

Comprehensive SQL tests were added in sql-tests/inputs/tuplesketch.sql.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.5

@cboumalh cboumalh changed the title [SPARK-55558][SQL] Add Support for Tuple/Theta Set Operations [SPARK-55558][SQL] Add support for Tuple/Theta set operations Feb 16, 2026
@cboumalh
Copy link
Contributor Author

cc @dtenedor @cloud-fan

@dtenedor
Copy link
Contributor

I asked a question offline, we can think about tradeoffs between adding new functionality from these functions versus increasing complexity in the API.

@dtenedor
Copy link
Contributor

We talked offline, it looks like there is a valid use case for this. We can proceed with the review.

@dtenedor
Copy link
Contributor

@cboumalh do you want to sync the latest master branch contents and push a commit to resolve the merge conflicts?

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Generally looks OK. I tried to think of testing ideas.

@cboumalh cboumalh requested a review from dtenedor February 19, 2026 14:48
@cboumalh
Copy link
Contributor Author

@dtenedor Thanks for taking a look! I addressed your comments. Please let me know if there is anything I missed 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants