Skip to content

Avoid duplicate bucket ID projection in native write paths #10359

@hotcodemacha

Description

@hotcodemacha

Overview

Native write paths in Spark 3.2/3.3 and the ClickHouse MergeTree writer
recompute the __bucket_value__ expression even when a precomputed
attribute already exists. This adds unnecessary overhead and complicates
downstream projections.

Steps to Reproduce

  • Write bucketed data using the native writer.
  • Inspect the execution plan; the bucket ID is projected multiple times
    instead of reusing a single attribute.

Expected Behavior

The bucket ID should be computed once (e.g., in an initial ProjectExec)
and reused by subsequent stages.

Actual Behavior

Every stage re-evaluates the bucket expression, leading to redundant
projections and performance overhead.

Impact

  • Increased CPU time for bucketed writes.
  • Harder-to-read execution plans with repetitive projections.

Proposed Fix

  • Guard projections in Spark 3.2/3.3 shims and ClickHouse MergeTree writer
    so they append __bucket_value__ only when missing.
  • Store the computed bucket ID in an attribute and reuse it downstream.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions