Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jan 23, 2026

Rationale for this change

Null-type dictionaries (e.g., dictionary(int8(), null())) are valid Arrow constructs supported from day one, but the sorting code had an uncertain XXX Should this support Type::NA? comment. We should explicitly support and test this because other functions already support this:

import pyarrow as pa
import pyarrow.compute as pc

pc.array_sort_indices(pa.array([None, None, None, None], type=pa.int32()))
# [0, 1, 2, 3]
pc.array_sort_indices(pa.DictionaryArray.from_arrays(
    indices=pa.array([None, None, None, None], type=pa.int8()),
    dictionary=pa.array([], type=pa.null())
))
# [0, 1, 2, 3]

I believe it does not make sense to specifically disallow this in dictionaries at this point.

What changes are included in this PR?

Added a unittest for null sorting behaviour.

Are these changes tested?

Yes, the unittest was added.

Are there any user-facing changes?

No.

@github-actions
Copy link

⚠️ GitHub issue #48954 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we add a test to validate it doesn't blow up or produce garbage even though there's no usefulness to it, right?
I was just trying to understand what does that even mean (the null-type dictionary sorting case).

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jan 23, 2026
@HyukjinKwon
Copy link
Member Author

Ah yeah. If I am correct, this behaviour is not being specifically tested. Yes. so I am removing the question, and adding a regression test.

The nullability comparison/behaviour from:

pc.array_sort_indices(pa.array([None, None, None, None], type=pa.int8()))

is already consistent with:

pc.array_sort_indices(pa.DictionaryArray.from_arrays(
    indices=pa.array([None, None, None, None], type=pa.int8()),
    dictionary=pa.array([], type=pa.null())
))

so I believe we don't necessarily have to change.

I am happy to remove the test or do something else if you have other preferences!

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jan 26, 2026
for (const auto& index_type : all_dictionary_index_types()) {
ARROW_SCOPED_TRACE("index_type = ", index_type->ToString());
auto dict_type = dictionary(index_type, null());
auto dict_arr = DictArrayFromJSON(dict_type, "[null, null, null, null]", "[]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, another way of encoding this is to have null values in the dictionary, perhaps test that too?
For example:

Suggested change
auto dict_arr = DictArrayFromJSON(dict_type, "[null, null, null, null]", "[]");
auto dict_arr = DictArrayFromJSON(dict_type, "[null, 0, 0, null]", "[null]");

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants