Skip to content

Conversation

@EpsilonPrime
Copy link
Contributor

Summary

This refactors how output schema conformance (nullability and type casting) is enforced
for the ClickHouse backend. Previously, an output_schema field was added to the RelRoot
proto message, and the C++ parser would implicitly add a final projection step if types
didn't match. This approach was non-standard and made the type conversion invisible in
the plan.

Now, when the expected output schema differs from the child plan's output (e.g., in
union operations), an explicit ProjectRel with cast expressions is added to the
Substrait plan on the Spark side. This makes the type enforcement visible in the plan
and follows standard Substrait conventions.

Changes

  • Add createOutputCastProjectRel() in WholeStageTransformer to generate a ProjectRel with casts when needed
  • Remove output_schema field from RelRoot proto message
  • Remove outputSchema parameter from PlanBuilder and PlanNode
  • Remove implicit type conversion logic from ClickHouse's SerializedPlanParser::adjustOutput()
  • Remove needOutputSchemaForPlan() from BackendSettingsApi and CHBackend

Benefits

  • Explicit over implicit: Type conversions are visible as a ProjectRel in the plan
  • Standard Substrait: No longer using a Gluten-specific extension to RelRoot
  • Simpler native code: ClickHouse parses the plan without special post-processing
  • Better debugging: The plan clearly shows where casts occur

Test Plan

  • Verify ClickHouse union tests pass (issue-1874 regression tests)
  • Verify nullable column handling in union operations
  • Run ClickHouse TPCH test suite

@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Dec 10, 2025
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@EpsilonPrime EpsilonPrime marked this pull request as ready for review December 11, 2025 17:06
EpsilonPrime pushed a commit to EpsilonPrime/gluten that referenced this pull request Dec 12, 2025
Updates documentation to reflect completed PRs:
- PR apache#11277: Remove unused enable_row_group_maxmin_index
- PR apache#11278: Replace output_schema with ProjectRel

Changes:
1. SubstraitDiffAnalysis.md:
   - Mark completed migrations with ✅
   - Update diff count: 262 → ~200 lines
   - Reorganize priority matrix into completed/pending
   - Add updated recommendations post-PR completion
   - Add progress metrics tracking

2. SubstraitUnfork-NextSteps.md (NEW):
   - Actionable next steps ranked by effort/impact
   - Recommended path: Upgrade to v0.77.0 first
   - Incremental alternatives with time estimates
   - 10 specific tasks with step-by-step guidance
   - Decision framework for upgrade vs incremental
   - Progress tracker table
   - Success criteria checklist

Next recommended actions:
1. Verify JOIN_TYPE changes (30 min quick win)
2. Upgrade to v0.77.0 for free wins (6-8 hours)
3. Migrate column_types to AdvancedExtension (2-3 hours)

Estimated remaining effort: 40-60 hours for complete unfork
Target: <100 line diff or all modifications in AdvancedExtension
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core DOCS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant