-
Notifications
You must be signed in to change notification settings - Fork 567
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Overview
Native write paths in Spark 3.2/3.3 and the ClickHouse MergeTree writer
recompute the __bucket_value__
expression even when a precomputed
attribute already exists. This adds unnecessary overhead and complicates
downstream projections.
Steps to Reproduce
- Write bucketed data using the native writer.
- Inspect the execution plan; the bucket ID is projected multiple times
instead of reusing a single attribute.
Expected Behavior
The bucket ID should be computed once (e.g., in an initial ProjectExec
)
and reused by subsequent stages.
Actual Behavior
Every stage re-evaluates the bucket expression, leading to redundant
projections and performance overhead.
Impact
- Increased CPU time for bucketed writes.
- Harder-to-read execution plans with repetitive projections.
Proposed Fix
- Guard projections in Spark 3.2/3.3 shims and ClickHouse MergeTree writer
so they append__bucket_value__
only when missing. - Store the computed bucket ID in an attribute and reuse it downstream.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request