Generalize JointOrderStatistics to discrete distributions#2038
Generalize JointOrderStatistics to discrete distributions#2038sethaxen wants to merge 9 commits intoJuliaStats:masterfrom
Conversation
| ``` | ||
| """ | ||
| struct JointOrderStatistics{ | ||
| D<:ContinuousUnivariateDistribution,R<:Union{AbstractVector{Int},Tuple{Int,Vararg{Int}}} |
There was a problem hiding this comment.
Technically this change to the signature would be breaking. Should that block this PR?
There was a problem hiding this comment.
For logpdf computation, I worked out how to compute the product of the accumulated vector with the product of matrices coming from adjacent tie terms and gap terms. Previously logpdf eval was O(g_0 g_1 + g_1^2 + g_1 g_2 + ... + g_{m-1} g_m) (where g_i is the size of the gap before block of observed ranks i) Now it's O(g_0 g_1 + g_1 g_2 + ... + g_{m-1} g_m), i.e. we do about half of the work. A particular case that this greatly accelerates is the one where m=2 and g_0,g_2 << g_1, which includes the poorly-scaling example in the OP, which now scales linearly in the large gap size instead of exponentially:
julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^2, (1, 10^2)), [1, 9])
0.000019 seconds (15 allocations: 2.125 KiB)
-7.129051018238897
julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^3, (1, 10^3)), [1, 9])
0.000069 seconds (17 allocations: 16.234 KiB)
-30.683658323405325
julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^4, (1, 10^4)), [1, 9])
0.000474 seconds (17 allocations: 156.859 KiB)
-286.87973021515063
julia> @time logpdf(JointOrderStatistics(Binomial(10, 0.3), 10^5, (1, 10^5)), [1, 9])
0.004534 seconds (17 allocations: 1.526 MiB)
-2866.023877162719Note: the product of the two Hankel matrices is no longer Hankel, so we can't use cross-correlation to perform the matrix-vector products. Also, if use FFT acceleration, we'll want to keep the matrix products separate.
| return logh | ||
| end | ||
|
|
||
| """ |
There was a problem hiding this comment.
This function is currently never used, but we need it if we use FFT acceleration.
There was a problem hiding this comment.
Same with the _log_xcorr_exp! function below.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2038 +/- ##
==========================================
+ Coverage 86.46% 86.48% +0.02%
==========================================
Files 147 147
Lines 8837 8971 +134
==========================================
+ Hits 7641 7759 +118
- Misses 1196 1212 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@devmotion would you like to review this? |
Currently
JointOrderStatisticsis restricted to wrapping only continuous distributions. This is because working with discrete order statistics is much more complicated due to the need to handle possible ties between unobserved order stats and observed ones. This PR addsrandandlogpdffor joint distributions of arbitrary discrete order statistics. Fixes #1839Notes
randmore or less uses the same implementation as for continuous. This comes from the observations:logpdfis based on the observation that if we observe all ranks, then the distribution is multinomial. It then explicitly marginalizes out all unobserved ranks. This can be formulated as a sequence of matrix-vector products, but based on the observation that each matrix is Hankel, these products are all discrete cross-correlations. These are performed using log-sum-exp for numerical stability.logpdfis quadratic in the lengths of the "gaps" (a contiguous sequence of unobserved ranks). This can be prohibitive for large n:Since the expensive operation is a discrete cross-correlation, we could instead use an FFT to make the method
O(g_i log(g_i))in the gap sizesg_i. However, care needs to be taken to avoid numerical underflow. I've tested the aFFT-C method in https://doi.org/10.1016/j.csda.2016.03.010 for this purpose, which only requires access to anfftmethod. I'm happy to add that to this PR or a later PR. I'm not certain if there's a way to use direct cross-correlation unless an AbstractFFTs backend is in scope; just adding anAbstractFFTsextension with the functionality could work, but if no backend is in scope, thenAbstractFFTsjust stackoverflows, which is not very user-friendly.