Skip to content

Generalize JointOrderStatistics to discrete distributions#2038

Open
sethaxen wants to merge 9 commits intoJuliaStats:masterfrom
sethaxen:joint_order_stats_discrete
Open

Generalize JointOrderStatistics to discrete distributions#2038
sethaxen wants to merge 9 commits intoJuliaStats:masterfrom
sethaxen:joint_order_stats_discrete

Conversation

@sethaxen
Copy link
Contributor

Currently JointOrderStatistics is restricted to wrapping only continuous distributions. This is because working with discrete order statistics is much more complicated due to the need to handle possible ties between unobserved order stats and observed ones. This PR adds rand and logpdf for joint distributions of arbitrary discrete order statistics. Fixes #1839

Notes

  • rand more or less uses the same implementation as for continuous. This comes from the observations:
    1. we can sample from a discrete distribution through passing a uniform draw through the quantile function
    2. the quantile function is non-decreasing, so if we have a sample of uniform order statistics, passing each of these through a discrete quantile function produces a sample of discrete order statistics with the potential ties.
  • logpdf is based on the observation that if we observe all ranks, then the distribution is multinomial. It then explicitly marginalizes out all unobserved ranks. This can be formulated as a sequence of matrix-vector products, but based on the observation that each matrix is Hankel, these products are all discrete cross-correlations. These are performed using log-sum-exp for numerical stability.
  • The computational complexity of logpdf is quadratic in the lengths of the "gaps" (a contiguous sequence of unobserved ranks). This can be prohibitive for large n:
julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^2, (1, 10^2)), [1, 9])
  0.000052 seconds (17 allocations: 2.953 KiB)
-7.129051018238783

julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^3, (1, 10^3)), [1, 9])
  0.003452 seconds (20 allocations: 24.117 KiB)
-30.68365832341351

julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^4, (1, 10^4)), [1, 9])
  0.308833 seconds (20 allocations: 235.055 KiB)
-286.87973020997015

julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^5, (1, 10^5)), [1, 9])
 28.671804 seconds (20 allocations: 2.289 MiB)
-2866.023878657026

Since the expensive operation is a discrete cross-correlation, we could instead use an FFT to make the method O(g_i log(g_i)) in the gap sizes g_i. However, care needs to be taken to avoid numerical underflow. I've tested the aFFT-C method in https://doi.org/10.1016/j.csda.2016.03.010 for this purpose, which only requires access to an fft method. I'm happy to add that to this PR or a later PR. I'm not certain if there's a way to use direct cross-correlation unless an AbstractFFTs backend is in scope; just adding an AbstractFFTs extension with the functionality could work, but if no backend is in scope, then AbstractFFTs just stackoverflows, which is not very user-friendly.

```
"""
struct JointOrderStatistics{
D<:ContinuousUnivariateDistribution,R<:Union{AbstractVector{Int},Tuple{Int,Vararg{Int}}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this change to the signature would be breaking. Should that block this PR?

Copy link
Contributor Author

@sethaxen sethaxen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For logpdf computation, I worked out how to compute the product of the accumulated vector with the product of matrices coming from adjacent tie terms and gap terms. Previously logpdf eval was O(g_0 g_1 + g_1^2 + g_1 g_2 + ... + g_{m-1} g_m) (where g_i is the size of the gap before block of observed ranks i) Now it's O(g_0 g_1 + g_1 g_2 + ... + g_{m-1} g_m), i.e. we do about half of the work. A particular case that this greatly accelerates is the one where m=2 and g_0,g_2 << g_1, which includes the poorly-scaling example in the OP, which now scales linearly in the large gap size instead of exponentially:

julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^2, (1, 10^2)), [1, 9])
  0.000019 seconds (15 allocations: 2.125 KiB)
-7.129051018238897

julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^3, (1, 10^3)), [1, 9])
  0.000069 seconds (17 allocations: 16.234 KiB)
-30.683658323405325

julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^4, (1, 10^4)), [1, 9])
  0.000474 seconds (17 allocations: 156.859 KiB)
-286.87973021515063

julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^5, (1, 10^5)), [1, 9])
  0.004534 seconds (17 allocations: 1.526 MiB)
-2866.023877162719

Note: the product of the two Hankel matrices is no longer Hankel, so we can't use cross-correlation to perform the matrix-vector products. Also, if use FFT acceleration, we'll want to keep the matrix products separate.

return logh
end

"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is currently never used, but we need it if we use FFT acceleration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with the _log_xcorr_exp! function below.

@codecov-commenter
Copy link

codecov-commenter commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 88.65248% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.48%. Comparing base (bd05c70) to head (6d0c78f).

Files with missing lines Patch % Lines
src/multivariate/jointorderstatistics.jl 88.65% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2038      +/-   ##
==========================================
+ Coverage   86.46%   86.48%   +0.02%     
==========================================
  Files         147      147              
  Lines        8837     8971     +134     
==========================================
+ Hits         7641     7759     +118     
- Misses       1196     1212      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sethaxen
Copy link
Contributor Author

sethaxen commented Mar 3, 2026

@devmotion would you like to review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ERROR: MethodError: no method matching JointOrderStatistics(::DiscreteNonParametric{Float64, Float64, Vector{Float64}, Vector{Float64}}, ::Int64)

2 participants