Skip to content

Conversation

@tonyalaribe
Copy link
Contributor

Hi @sunng87, thanks for this really useful project you created.

At the moment, Dictionary types are not supported in datafusion-postgres.
But they're important since projects like delta-rs represent partition columns as Dictionary types. Specifically Dictionary<UInt16, Utf8> for Utf8 columns.

So running a query via datafusion-postgres would result in an error such as:

UserError(
    ErrorInfo {
        severity: "ERROR",
        code: "XX000",
        message: "Unsupported Datatype Dictionary(UInt16, Utf8) and array DictionaryArray {
    keys: PrimitiveArray<UInt16>
    [
        0,
    ]
    values: StringArray
    [
        \"test_project\",
    ]
}",
    }
)

In the datafusion codebase: https://github.com/apache/datafusion/blob/main/datafusion/common/src/dfschema.rs#L665

Dictionary types are treated as logically equal to their values:

(DataType::Dictionary(_, v1), DataType::Dictionary(_, v2)) => {
                v1.as_ref() == v2.as_ref()
            }
            (DataType::Dictionary(_, v1), othertype) => v1.as_ref() == othertype,
            (othertype, DataType::Dictionary(_, v1)) => v1.as_ref() == othertype,

I'm not very sure what the edge cases would be, i.e. what other Postgres types could be represented as a Dictionary type. I assume postgres hstore would be encoded as a Map and not a Dictionary.

In anycase, this PR resolves the error from parsing delta-rs partition columns through datafusion-postgres.

@tonyalaribe
Copy link
Contributor Author

tonyalaribe commented Apr 9, 2025

Alternatively, we can treat only a Dictionary<UInt16, _> as the special case and leave other dictionary types to be encoded differently. Let me know any direction you prefer for the implementation

@sunng87
Copy link
Member

sunng87 commented Apr 10, 2025

@tonyalaribe Thank you for the patch! Could you give me some background about Dictionary<Uint16, Utf8>, why we need treat it specially, and the use-case? When working with GreptimeDB, I had little experience with this type.

@tonyalaribe
Copy link
Contributor Author

@tonyalaribe Thank you for the patch! Could you give me some background about Dictionary<Uint16, Utf8>, why we need to treat it specially, and the use-case? When working with GreptimeDB, I had little experience with this type.

Hi @sunng87, This happens because delta-rs transforms partitioned columns into Dictionaries.

If a column is Utf8 in the original schema but is marked to be a partition column, delta-rs would internally change the column type to Dictionary<Uint16, Utf8>. This is done for all partitioned columns, except Booleans columns.

Here's the code where this conversion is done in delta-rs:

https://github.com/delta-io/delta-rs/blob/64f6369107a2fbc6f8c566ac2fc9786b0f1498bd/crates/core/src/delta_datafusion/mod.rs#L223

@tonyalaribe
Copy link
Contributor Author

Here's another relevant bit of conversation:

"Arrow should be able to cast any type into its dictionary form."
delta-io/delta-rs#1445 (comment)

@tonyalaribe
Copy link
Contributor Author

I tried to condense the PR into the smallest code needed for the dictionary handling to work. what do you think?

With this implementation, we ignore the keys of the dictionary and focus on the values.

@tonyalaribe tonyalaribe requested a review from sunng87 April 14, 2025 08:39
@sunng87
Copy link
Member

sunng87 commented Apr 15, 2025

Thank you for the clarification. Sounds reasonable to me!

@sunng87 sunng87 merged commit 1c6b0d4 into datafusion-contrib:master Apr 15, 2025
6 checks passed
@sunng87
Copy link
Member

sunng87 commented Apr 15, 2025

By the way, because this is specific to delta-rs, I may add some modification for this in future. For example, make this opt-in via an option.

@tonyalaribe
Copy link
Contributor Author

By the way, because this is specific to delta-rs, I may add some modification for this in future. For example, make this opt-in via an option.

Yeah, that's a good idea, especially if it interferes with other setups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants