Add hsm_mode: Half-Sample Mode for continuous data #984

anurag-mds · 2025-12-27T09:12:39Z

Summary

This PR adds hsm_mode(), an implementation of the half-sample mode (HSM) which is a robust estimator of the mode for continuous distributions.

It is introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour while providing a statistically meaningful alternative for continuous data

This addresses and closes issue #957.

Motivation

StatsBase.mode() is frequency-based and works well for discrete data. For continuous distributions, however, samples are usually unique, which makes frequency counts unstable and highly variable in practice

Issue #957 documents this behaviour, particularly for heavy-tailed distributions, where mode() can show extreme variance.

This PR provides an estimator designed specifically for continuous data

Approach

hsm_mode() implements the standard half-sample method described in the literature:

Non-finite values (NaN, Inf) are filtered
The data are sorted
The algorithm repeatedly selects the contiguous half-sample with the smallest width
Once ≤ 2 points remain, the midpoint of the final interval is returned
The midpoint may not be a sample value, but provides a stable estimate of the location of highest density.
Time complexity is dominated by sorting (O(n log n)); space complexity is O(n).
After sorting, the contraction loop operates on SubArray views to avoid allocations.

API Design

The estimator is exposed as a new function:

hsm_mode(x::AbstractVector{T}) where T<:Real

It is NOT added as an overload of mode() in order to:

avoid changing existing semantics
clearly distinguish frequency-based and density-based estimation
let users choose the appropriate method explicitly

The return type is an AbstractFloat, promoted from the input element type (e.g. integers → Float64, Float32 → Float32).

Testing and Documentation

Tests cover basic correctness, edge cases, robustness to outliers, handling of non-finite values, and type behavior. All tests pass.

The docstring explains intended use cases, compares with mode(), documents complexity, provides examples, and cites the relevant literature.

References

Robertson & Cryer (1974), JASA

Bickel & Fruehwirth (2006), CSDA

Notes

This PR is intentionally small and focused. Extensions such as weighted HSM or support for missing values are left for future work.

I have attached images showing that the tests are passing, and I would like to know whether I should address the existing warnings in the codebase.

Images:

Feedback on naming or API placement is welcome.

…JuliaStats#957) Adds hsm_mode() to provide a density-based alternative to mode() for continuous distributions. Handles NaN/Inf filtering, single-element and empty vectors, and edge cases with outliers. Type-stable and zero-allocation where possible. Includes 17 tests covering standard, extreme, and boundary conditions. Based on Bickel & Fruehwirth (2006)

… made by me.

The hsm_mode function was added in the previous commit but wasn't included in the documentation, causing the CI build to fail with a missing docstring error. This commit adds hsm_mode to the Mode and Modes section of the scalar statistics documentation.

devmotion · 2025-12-27T11:01:55Z

Was this PR AI-generated?

Changed window size calculation from floor(n/2) to ceil(n/2) to match the Robertson-Cryer algorithm specification. The half-sample mode should examine windows containing at least half the observations. Updated test expectations to match the corrected algorithm: - [1,2,3] now returns 1.5 (was 1.0) - [-5,-3,-1,0,1,3,5] now returns -0.5 (was -1.0) - [1.0097, 1.0054, ...] now returns ~1.00431 (was ~1.00003) All tests pass. Type preservation (Float32 -> Float32, Int -> Float64) works correctly.

anurag-mds · 2025-12-27T12:10:38Z

The implementation, test and commits are mine. I reviewed existing StatsBase prs to match project conventions, and I used AI assistance only for wording clarity in the pr description and for general performance review and bugs made by me. it was not used for algorithm or code . since in ci pipeline documentation issues which I totally forgot was missing I am actively fixing that like using ceil(len/2) as per the definition. I am eager to explain any design or choices in detail

ForceBru · 2025-12-28T00:06:10Z

introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour

I feel like the existing behavior (for floating-point numbers) is a bug. If your data are not integers, we must assume that they come from a continuous distribution and use the HSM algorithm or another estimator for continuous data.

Or perhaps introduce a keyword argument:

mode(x::AbstractVector{<:AbstractFloat}; discrete::Bool=false) =
  discrete ? mode_discrete(x) : mode_hsm(x)

mode(x::AbstractVector) = mode_discrete(x)

Here, mode_discrete is what's currently called mode. The keyword argument lets one use the high-variance estimator for AbstractFloat.

I'm not a contributor to Distributions.jl, though (although I'd like to become a contributor; my complete PR with hyperbolic distributions seems to have joined its peers, unfortunately), that's just my opinion.

anurag-mds · 2025-12-28T06:08:38Z

I think using frequency-based mode for floats can be misleading.

But making HSM the default for AbstractFloat might break cases where floats are actually categories or rounded values.

Maybe a keyword like discrete=false or a dispatch mode(x, HSM()) works better. I can change the API like that if the maintainers think it’s right.

anurag-mds · 2025-12-28T07:14:36Z

@ForceBru
You have made an excellent point about API design consistency. You're right
that distinguishing frequency-based vs density-based estimation matters.

Here's the evidence for why hsm_mode() should remain separate:

As you can see
Insight: Frequency-based mode() counts collisions on continuous
data where samples are usually unique. The "most frequent" element becomes
arbitrary and order-dependent. HSM finds the actual density peak instead.

Design:

mode(): Frequency-based, correct for discrete data [keep as-is]
hsm_mode(): Density-based, essential for continuous data [separate]

Users can explicitly choose the right tool. No silent behavior changes.

So are you okay with this separate function approach or shall we explore the keyword argument variant further ?

anurag-mds · 2025-12-29T20:58:20Z

All tests pass, edge cases handled, documentation added. @devmotion are there any remaining technical concerns with hsm_mode() or its API before approval?

= added 3 commits December 27, 2025 11:35

Fixed some discrepancies by fixing typos and removing temporary files…

3de6546

… made by me.

Add hsm_mode to documentation

727c530

The hsm_mode function was added in the previous commit but wasn't included in the documentation, causing the CI build to fail with a missing docstring error. This commit adds hsm_mode to the Mode and Modes section of the scalar statistics documentation.

anurag-mds closed this Dec 27, 2025

anurag-mds reopened this Dec 27, 2025

Merge branch 'master' into hsm-mode

e086cc3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add hsm_mode: Half-Sample Mode for continuous data #984

Add hsm_mode: Half-Sample Mode for continuous data #984

Uh oh!

anurag-mds commented Dec 27, 2025

Uh oh!

devmotion commented Dec 27, 2025

Uh oh!

anurag-mds commented Dec 27, 2025

Uh oh!

ForceBru commented Dec 28, 2025

Uh oh!

anurag-mds commented Dec 28, 2025

Uh oh!

anurag-mds commented Dec 28, 2025 •

edited

Loading

Uh oh!

anurag-mds commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add hsm_mode: Half-Sample Mode for continuous data #984

Are you sure you want to change the base?

Add hsm_mode: Half-Sample Mode for continuous data #984

Uh oh!

Conversation

anurag-mds commented Dec 27, 2025

Summary

Motivation

Approach

API Design

Testing and Documentation

References

Notes

I have attached images showing that the tests are passing, and I would like to know whether I should address the existing warnings in the codebase.

Uh oh!

devmotion commented Dec 27, 2025

Uh oh!

anurag-mds commented Dec 27, 2025

Uh oh!

ForceBru commented Dec 28, 2025

Uh oh!

anurag-mds commented Dec 28, 2025

Uh oh!

anurag-mds commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anurag-mds commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anurag-mds commented Dec 28, 2025 •

edited

Loading