-
Notifications
You must be signed in to change notification settings - Fork 22
Support Dictionary types as they're logically equivalent to their value types. Needed for delta-rs partition columns. #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Alternatively, we can treat only a Dictionary<UInt16, _> as the special case and leave other dictionary types to be encoded differently. Let me know any direction you prefer for the implementation |
|
@tonyalaribe Thank you for the patch! Could you give me some background about |
Hi @sunng87, This happens because delta-rs transforms partitioned columns into Dictionaries. If a column is Utf8 in the original schema but is marked to be a partition column, delta-rs would internally change the column type to Dictionary<Uint16, Utf8>. This is done for all partitioned columns, except Booleans columns. Here's the code where this conversion is done in delta-rs: |
|
Here's another relevant bit of conversation: "Arrow should be able to cast any type into its dictionary form." |
|
I tried to condense the PR into the smallest code needed for the dictionary handling to work. what do you think? With this implementation, we ignore the keys of the dictionary and focus on the values. |
|
Thank you for the clarification. Sounds reasonable to me! |
|
By the way, because this is specific to delta-rs, I may add some modification for this in future. For example, make this opt-in via an option. |
Yeah, that's a good idea, especially if it interferes with other setups. |
Hi @sunng87, thanks for this really useful project you created.
At the moment, Dictionary types are not supported in datafusion-postgres.
But they're important since projects like delta-rs represent partition columns as Dictionary types. Specifically
Dictionary<UInt16, Utf8>for Utf8 columns.So running a query via datafusion-postgres would result in an error such as:
In the datafusion codebase: https://github.com/apache/datafusion/blob/main/datafusion/common/src/dfschema.rs#L665
Dictionary types are treated as logically equal to their values:
I'm not very sure what the edge cases would be, i.e. what other Postgres types could be represented as a Dictionary type. I assume postgres hstore would be encoded as a Map and not a Dictionary.
In anycase, this PR resolves the error from parsing delta-rs partition columns through datafusion-postgres.