Skip to content

Sort tied categorical histogram frequencies alphabetically#684

Merged
kyraman merged 1 commit intoawslabs:masterfrom
kyraman:master
Mar 25, 2026
Merged

Sort tied categorical histogram frequencies alphabetically#684
kyraman merged 1 commit intoawslabs:masterfrom
kyraman:master

Conversation

@kyraman
Copy link
Copy Markdown
Contributor

@kyraman kyraman commented Mar 25, 2026

Description of changes:

Add lexicographic (alphabetical) secondary sort as a tiebreaker when categorical values share the same frequency count in the Histogram analyzer.

Before this change, if two categories had the same count, their ordering was non-deterministic (dependent on Spark's internal partition layout). For example, given Iris species data where "Iris-versicolor" and "Iris-virginica" both appear 50 times, the result could return either order across different runs.

After this change, tied values are sorted alphabetically, so the output is always:

Iris-setosa       (51)
Iris-versicolor   (50)   alphabetically before Iris-virginica
Iris-virginica    (50)

This matches the convention used by tools like pandas value_counts() and R's table(), which all use lexicographic ordering as a tiebreaker for equal counts.

Note: This change only affects the categorical Histogram analyzer. The numeric HistogramBinned analyzer, which uses equal-width bin edges, is unaffected.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@kyraman kyraman merged commit 51c5b70 into awslabs:master Mar 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants