-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: Ensure ListingTable partitions are pruned when filters are not used #17958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
peasee
wants to merge
4
commits into
apache:main
Choose a base branch
from
peasee:peasee-patch-1
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to worry about filters such as
<partition column> IS NULL
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question, and has made me open a little can of worms.
For the filtering case, it looks like we do not infer the filter for
<partition column> IS NULL
, and the column is defined as not nullable when the listing table is built. We should probably fix this to support null columns, but we'll need to introduce some configuration for users to specify the null fallback value outside of the default__HIVE_DEFAULT_PARTITION__
.For the non-filtering case, this PR will indeed match null partition column values in the current implementation - but because we don't have any special treatment of them, it would return as the literal text
"__HIVE_DEFAULT_PARTITION__"
for example. If your partition column is then set as anInt32
for example, the query will fail.I think implementing proper support for the nulls will need more work outside of this PR. Because we already define the column as non-nullable, what do you think about manually excluding
__HIVE_DEFAULT_PARTITION__
values from theparse_partitions_for_path
to prevent query errors like the one I describe above, until proper support for nulls is added? I can raise an issue and start working on it as well.This won't help people with custom null fallback values, but would help for all default cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is an existing issue before this PR as well, a non-filter scan (e.g.
SELECT <partition column> FROM blah
) will also match all files including the null fallback partition:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like reasonable behavior to me that
SELECT <partition column> FROM blah
would also return data without a value for the partition column (as a value ofNULL
).But that being said I am not sure even what defines the "expected" behavior in this case (e.g. what does Hive do in this case 🤔 )
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should've been clearer that the
SELECT <partition column> FROM blah
works but it returns the value as__HIVE_DEFAULT_PARTITION__
for example - we don't convert them toNULL
yet, which I think we probably ought to.Hive itself seems to not convert the value at all, suggested by this open issue. I struggled to find direct references to
HIVE_DEFAULT_PARTITION
in the Hive docs, I only came across this ingestion guide which explains how the Hive writer rewritesNULL
into theHIVE_DEFAULT_PARTITION
string.I guess it is up to the implementer to decide what to do with that, because a text partition on a partition column you expect to be an integer could be annoying 😕 It looks like Impala decided to treat them as
NULL
values: IMPALA-252.I found the Impala ticket via this Spark issue and associated GitHub PR apache/spark#17277 which also had an interesting discussion about this.
The side affect both of these tickets mention is that Hive does not differentiate between an empty string or a null value, so there is no way to tell which one a
HIVE_DEFAULT_PARTITION
is.