Skip to content

Conversation

sgrebnov
Copy link

@sgrebnov sgrebnov commented Oct 1, 2025

Which issue does this PR close?

PR fixes schema mismatch errors when using IcebergCommitExec . Below is example DataFusion DataSinkExec implementation demonstrating that properties must be created based on target schema, not input.

https://github.com/apache/datafusion/blob/4eacb6046773b759dae0b3d801fe8cb1c6b65c0f/datafusion/datasource/src/sink.rs#L101C1-L117C6

impl DataSinkExec {
    /// Create a plan to write to `sink`
    pub fn new(
        input: Arc<dyn ExecutionPlan>,
        sink: Arc<dyn DataSink>,
        sort_order: Option<LexRequirement>,
    ) -> Self {
        let count_schema = make_count_schema();
        let cache = Self::create_schema(&input, count_schema);
        Self {
            input,
            sink,
            count_schema: make_count_schema(),
            sort_order,
            cache,
        }
    }

Note: I was not able to quickly identify the difference between the initial use case—when IcebergCommitExec is the top plan node and it does not fail - and the case where an additional node is added to invalidate the cache after a write and I see errors. I suspect this behavior is due to DataFusion optimizations or verifications to ensure that inputs are correct and compatible. This fixes the following error

An internal error occurred. Internal error: PhysicalOptimizer rule 'OutputRequirements' failed. Schema mismatch. Expected original schema: Schema { fields: [Field { name: "count", data_type: UInt64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, got new schema: Schema { fields: [Field { name: "r_regionkey", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"PARQUET:field_id": "1"} }, Field { name: "r_name", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"PARQUET:field_id": "2"} }, Field { name: "r_comment", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"PARQUET:field_id": "3"} }], metadata: {} }.
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues

What changes are included in this PR?

Are these changes tested?

@sgrebnov sgrebnov self-assigned this Oct 1, 2025
@sgrebnov sgrebnov changed the title Improve IcebergCommitExec to correctly specify properties schema Improve IcebergCommitExec to correctly populate properties schema Oct 1, 2025
@sgrebnov sgrebnov merged commit cdb2321 into spiceai-0.7.0-rc1 Oct 1, 2025
3 checks passed
@sgrebnov
Copy link
Author

sgrebnov commented Oct 2, 2025

Created upstream PR: apache#1721

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants