-
Notifications
You must be signed in to change notification settings - Fork 197
Add hsm_mode: Half-Sample Mode for continuous data #984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…JuliaStats#957) Adds hsm_mode() to provide a density-based alternative to mode() for continuous distributions. Handles NaN/Inf filtering, single-element and empty vectors, and edge cases with outliers. Type-stable and zero-allocation where possible. Includes 17 tests covering standard, extreme, and boundary conditions. Based on Bickel & Fruehwirth (2006)
The hsm_mode function was added in the previous commit but wasn't included in the documentation, causing the CI build to fail with a missing docstring error. This commit adds hsm_mode to the Mode and Modes section of the scalar statistics documentation.
|
Was this PR AI-generated? |
Changed window size calculation from floor(n/2) to ceil(n/2) to match the Robertson-Cryer algorithm specification. The half-sample mode should examine windows containing at least half the observations. Updated test expectations to match the corrected algorithm: - [1,2,3] now returns 1.5 (was 1.0) - [-5,-3,-1,0,1,3,5] now returns -0.5 (was -1.0) - [1.0097, 1.0054, ...] now returns ~1.00431 (was ~1.00003) All tests pass. Type preservation (Float32 -> Float32, Int -> Float64) works correctly.
|
The implementation, test and commits are mine. I reviewed existing StatsBase prs to match project conventions, and I used AI assistance only for wording clarity in the pr description and for general performance review and bugs made by me. it was not used for algorithm or code . since in ci pipeline documentation issues which I totally forgot was missing I am actively fixing that like using ceil(len/2) as per the definition. I am eager to explain any design or choices in detail |
I feel like the existing behavior (for floating-point numbers) is a bug. If your data are not integers, we must assume that they come from a continuous distribution and use the HSM algorithm or another estimator for continuous data. Or perhaps introduce a keyword argument: mode(x::AbstractVector{<:AbstractFloat}; discrete::Bool=false) =
discrete ? mode_discrete(x) : mode_hsm(x)
mode(x::AbstractVector) = mode_discrete(x)Here, I'm not a contributor to Distributions.jl, though (although I'd like to become a contributor; my complete PR with hyperbolic distributions seems to have joined its peers, unfortunately), that's just my opinion. |
|
I think using frequency-based mode for floats can be misleading. But making HSM the default for AbstractFloat might break cases where floats are actually categories or rounded values. Maybe a keyword like discrete=false or a dispatch mode(x, HSM()) works better. I can change the API like that if the maintainers think it’s right. |
|
@ForceBru Here's the evidence for why
As you can see Design:
Users can explicitly choose the right tool. No silent behavior changes. So are you okay with this separate function approach or shall we explore the keyword argument variant further ? |
|
All tests pass, edge cases handled, documentation added. @devmotion are there any remaining technical concerns with hsm_mode() or its API before approval? |

Summary
This PR adds hsm_mode(), an implementation of the half-sample mode (HSM) which is a robust estimator of the mode for continuous distributions.It is introduced as a separate function ( not an overload of mode() ) to preserve existing behaviour while providing a statistically meaningful alternative for continuous data
This addresses and closes issue #957.
Motivation
StatsBase.mode() is frequency-based and works well for discrete data. For continuous distributions, however, samples are usually unique, which makes frequency counts unstable and highly variable in practiceIssue #957 documents this behaviour, particularly for heavy-tailed distributions, where mode() can show extreme variance.
This PR provides an estimator designed specifically for continuous data
Approach
hsm_mode() implements the standard half-sample method described in the literature:Non-finite values (NaN, Inf) are filtered
The data are sorted
The algorithm repeatedly selects the contiguous half-sample with the smallest width
Once ≤ 2 points remain, the midpoint of the final interval is returned
The midpoint may not be a sample value, but provides a stable estimate of the location of highest density.
Time complexity is dominated by sorting (O(n log n)); space complexity is O(n).
After sorting, the contraction loop operates on SubArray views to avoid allocations.
API Design
The estimator is exposed as a new function:
hsm_mode(x::AbstractVector{T}) where T<:Real
It is NOT added as an overload of mode() in order to:
avoid changing existing semantics
clearly distinguish frequency-based and density-based estimation
let users choose the appropriate method explicitly
The return type is an AbstractFloat, promoted from the input element type (e.g. integers → Float64, Float32 → Float32).
Testing and Documentation
Tests cover basic correctness, edge cases, robustness to outliers, handling of non-finite values, and type behavior. All tests pass.
The docstring explains intended use cases, compares with mode(), documents complexity, provides examples, and cites the relevant literature.
References
Robertson & Cryer (1974), JASA
Bickel & Fruehwirth (2006), CSDA
Notes
This PR is intentionally small and focused. Extensions such as weighted HSM or support for missing values are left for future work.
I have attached images showing that the tests are passing, and I would like to know whether I should address the existing warnings in the codebase.
Images:






Feedback on naming or API placement is welcome.