Commit 91cfb69
feat(proto): Add protobuf serialization for HashExpr (#19379)
## Summary
This PR adds protobuf serialization/deserialization support for
`HashExpr`, enabling distributed query execution to serialize hash
expressions used in hash joins and repartitioning.
This is a followup to #18393 which introduced `HashExpr` but did not add
serialization support.
This causes errors when serialization is triggered on a query that
pushes down dynamic filters from a `HashJoinExec`.
As of #18393 `HashJoinExec` produces filters of the form:
```sql
CASE (hash_repartition % 2)
WHEN 0 THEN
a >= ab AND a <= ab AND
b >= bb AND b <= bb AND
hash_lookup(a,b)
WHEN 1 THEN
a >= aa AND a <= aa AND
b >= ba AND b <= ba AND
hash_lookup(a,b)
ELSE
FALSE
END
```
Where `hash_lookup` is an expression that holds a reference to a given
partitions hash join hash table and will check for membership.
Since we created these new expressions but didn't make any of them
serializable any attempt to do a distributed query or similar would run
into errors.
In #19300 we fixed
`hash_lookup` by replacing it with `true` since it can't be serialized
across the wire (we'd have to send the entire hash table). The logic was
that this preserves the bounds checks, which as still valuable.
This PR handles `hash_repartition` which determines which partition (and
hence which branch of the `CASE` expression) the row belongs to. For
this expression we *can* serialize it, so that's what I'm doing in this
PR.
### Key Changes
- **SeededRandomState wrapper**: Added a `SeededRandomState` struct that
wraps `ahash::RandomState` while preserving the seeds used to create it.
This is necessary because `RandomState` doesn't expose seeds after
creation, but we need them for serialization.
- **Updated seed constants**: Changed `HASH_JOIN_SEED` and
`REPARTITION_RANDOM_STATE` constants to use `SeededRandomState` instead
of raw `RandomState`.
- **HashExpr enhancements**:
- Changed `HashExpr` to use `SeededRandomState`
- Added getter methods: `on_columns()`, `seeds()`, `description()`
- Exported `HashExpr` and `SeededRandomState` from the joins module
- **Protobuf support**:
- Added `PhysicalHashExprNode` message to `datafusion.proto` with fields
for `on_columns`, seeds (4 `u64` values), and `description`
- Implemented serialization in `to_proto.rs`
- Implemented deserialization in `from_proto.rs`
## Test plan
- [x] Added roundtrip test in `roundtrip_physical_plan.rs` that creates
a `HashExpr`, serializes it, deserializes it, and verifies the result
- [x] All existing hash join tests pass (583 tests)
- [x] All proto roundtrip tests pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <[email protected]>1 parent 887aa9f commit 91cfb69
File tree
13 files changed
+670
-29
lines changed- datafusion
- physical-plan/src
- joins
- hash_join
- repartition
- proto
- proto
- src
- generated
- physical_plan
- tests/cases
13 files changed
+670
-29
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
86 | 86 | | |
87 | 87 | | |
88 | 88 | | |
| 89 | + | |
| 90 | + | |
89 | 91 | | |
90 | | - | |
91 | | - | |
| 92 | + | |
| 93 | + | |
92 | 94 | | |
93 | 95 | | |
94 | 96 | | |
| |||
334 | 336 | | |
335 | 337 | | |
336 | 338 | | |
337 | | - | |
338 | | - | |
| 339 | + | |
| 340 | + | |
339 | 341 | | |
340 | 342 | | |
341 | 343 | | |
| |||
930 | 932 | | |
931 | 933 | | |
932 | 934 | | |
933 | | - | |
| 935 | + | |
934 | 936 | | |
935 | 937 | | |
936 | 938 | | |
| |||
958 | 960 | | |
959 | 961 | | |
960 | 962 | | |
961 | | - | |
| 963 | + | |
962 | 964 | | |
963 | 965 | | |
964 | 966 | | |
| |||
1041 | 1043 | | |
1042 | 1044 | | |
1043 | 1045 | | |
1044 | | - | |
| 1046 | + | |
1045 | 1047 | | |
1046 | 1048 | | |
1047 | 1049 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
| 21 | + | |
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| |||
0 commit comments