Skip to content

column types must match schema types occurs after unnest_columns on another column #14218

@ion-elgreco

Description

@ion-elgreco

Describe the bug

I am rewriting our CDF operation in delta-rs, the code looks roughly like this:

    let mut projected = if should_cdc {
        operation_count
            .clone()
            .with_column(
                CDC_COLUMN_NAME,
                when(col(TARGET_DELETE_COLUMN).is_null(), lit("delete")) // nulls are equal to True
                    .when(col(DELETE_COLUMN).is_null(), lit("source_delete"))
                    .when(col(TARGET_COPY_COLUMN).is_null(), lit("copy"))
                    .when(col(TARGET_INSERT_COLUMN).is_null(), lit("insert"))
                    .when(col(TARGET_UPDATE_COLUMN).is_null(), lit("update"))
                    .end()?,
            )?
            // .drop_columns(&["__delta_rs_path"])? // WEIRD bug caused by interaction with unnest_columns, has to be dropped otherwise throws schema error
            .with_column(
                "__delta_rs_update_expanded",
                when(
                    col(CDC_COLUMN_NAME).eq(lit("update")),
                    lit(ScalarValue::List(ScalarValue::new_list(
                        &[
                            ScalarValue::Utf8(Some("update_preimage".into())),
                            ScalarValue::Utf8(Some("update_postimage".into())),
                        ],
                        &DataType::List(Field::new("element", DataType::Utf8, false).into()),
                        true,
                    ))),
                )
                .end()?,
            )?
            .unnest_columns(&["__delta_rs_update_expanded"])?
            .with_column(
                CDC_COLUMN_NAME,
                when(
                    col(CDC_COLUMN_NAME).eq(lit("update")),
                    col("__delta_rs_update_expanded"),
                )
                .otherwise(col(CDC_COLUMN_NAME))?,
            )?
            .drop_columns(&["__delta_rs_update_expanded"])?
            .select(write_projection_with_cdf)?

I noticed that when I do unnest_columns on another column, it complains afterwards about a schema error:

Result::unwrap()` on an `Err` value: Arrow { source: InvalidArgumentError("column types must match schema types, expected Utf8 but found Dictionary(UInt16, Utf8) at column index 7") }

Since I don't need the column, I can safely drop it beforehand, but I don't understand why doesn't Dictionary(UInt16, Utf8) just coerce to utf8?

To Reproduce

Bit difficult but, you can run grab my branch: https://github.com/ion-elgreco/delta-rs/tree/refactor--combine_execution_plans

And then you run the test test_merge_cdc_enabled_simple, with this line commented out: .drop_columns(&["__delta_rs_path"])?

Expected behavior

I guess coerce gracefully?

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions