Sort tied categorical histogram frequencies alphabetically by kyraman · Pull Request #684 · awslabs/deequ

kyraman · 2026-03-25T15:01:27Z

Description of changes:

Add lexicographic (alphabetical) secondary sort as a tiebreaker when categorical values share the same frequency count in the Histogram analyzer.

Before this change, if two categories had the same count, their ordering was non-deterministic (dependent on Spark's internal partition layout). For example, given Iris species data where "Iris-versicolor" and "Iris-virginica" both appear 50 times, the result could return either order across different runs.

After this change, tied values are sorted alphabetically, so the output is always:

Iris-setosa       (51)
Iris-versicolor   (50)   alphabetically before Iris-virginica
Iris-virginica    (50)

This matches the convention used by tools like pandas value_counts() and R's table(), which all use lexicographic ordering as a tiebreaker for equal counts.

Note: This change only affects the categorical Histogram analyzer. The numeric HistogramBinned analyzer, which uses equal-width bin edges, is unaffected.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Sort tied categorical histogram frequencies alphabetically

45b733a

SamPom100 approved these changes Mar 25, 2026

View reviewed changes

kyraman merged commit 51c5b70 into awslabs:master Mar 25, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort tied categorical histogram frequencies alphabetically#684

Sort tied categorical histogram frequencies alphabetically#684
kyraman merged 1 commit intoawslabs:masterfrom
kyraman:master

kyraman commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kyraman commented Mar 25, 2026

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants