Dispatch for drawing multiples#1985
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1985 +/- ##
==========================================
- Coverage 86.44% 86.42% -0.03%
==========================================
Files 147 147
Lines 8838 8898 +60
==========================================
+ Hits 7640 7690 +50
- Misses 1198 1208 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
…y - fallback to rand(d)
|
@devmotion thanks a lot for the review.
No docs are added |
|
The last item that I think about is the 10M cap for switching between algorithm. I'm thinking to proceed with a similar if/else of 0.25% of distribution. Similar (not-ideal) patterns will be easy to identify in feature and update. |
|
@devmotion thanks for the last comments.
|
|
Re-evaluated benchmarks in the header, |
|
Comments are resolved, this PR is ready for review |
|
As a general comment before any detailed review:
Can we test different number of samples? n = 10000 is a bit extreme. Can we check n = 1, n = 5, n = 10, n = 50, n = 100, n = 500, n = 1000 etc. as well? |
|
It would be great to proceed with merging. @devmotion could you please check, if there is anything essential to test/fix? thanks |
| rand(rng::AbstractRNG, d::MixtureModel{Univariate}) = | ||
| rand(rng, component(d, rand(rng, d.prior))) | ||
|
|
||
| function rand(rng::AbstractRNG, d::MixtureModel{Univariate}, n::Int) |
There was a problem hiding this comment.
This is the wrong dispatch, isn't it? If one wants to draw multiple samples from a distribution d, automatically Distributions dispatches to drawing samples with sampler(d). In the case of mixture models this is MixtureSampler. So this should actually be defined for MixtureSampler{Univariate}, not MixtureModel{Univariate} AFAICT?
However, generally for univariate distributions one also shouldn't define rand(rng, dist, n) but only the in-place method _rand!(rng, sampler(dist), out) (#1905 will fix possibly incorrectly allocated output arrays) if multiple samples can be generated more efficiently:
Distributions.jl/src/genericrand.jl
Line 35 in f1ff9e8
Distributions.jl/src/univariates.jl
Lines 141 to 150 in f1ff9e8
So AFAICT we should only define
| function rand(rng::AbstractRNG, d::MixtureModel{Univariate}, n::Int) | |
| function _rand!(rng::AbstractRNG, d::MixtureSampler{Univariate}, x::AbstractArray{<:Real}) |
Alternatively, if we never want to use MixtureSampler for sampling MixtureModel{Univariate}, we should define sampler(d::MixtureModel{Univariate}) = d (or limit the definition below to sampler(d::MixtureModel{Multivariate}) = MixtureSampler(d)) and here define
| function rand(rng::AbstractRNG, d::MixtureModel{Univariate}, n::Int) | |
| function _rand!(rng::AbstractRNG, d::MixtureModel{Univariate}, x::AbstractArray{<:Real}) |
| # Find the component with the maximum count to minimize resizing | ||
| max_count_idx = argmax(counts) | ||
| max_count = counts[max_count_idx] | ||
|
|
||
| # Sample from the component with maximum count first and use it directly | ||
| x = rand(rng, component(d, max_count_idx), max_count) | ||
|
|
||
| # Resize to the full size and continue with other components | ||
| resize!(x, n) | ||
| offset = max_count | ||
|
|
||
| for i in eachindex(counts) | ||
| if i != max_count_idx | ||
| ni = counts[i] | ||
| if ni > 0 | ||
| c = component(d, i) | ||
| last_offset = offset + ni - 1 | ||
| rand!(rng, c, @view(x[(begin+offset):(begin+last_offset)])) | ||
| offset = last_offset + 1 | ||
| end | ||
| end | ||
| end |
There was a problem hiding this comment.
For the in-place method, it seems this could be simplified to
| # Find the component with the maximum count to minimize resizing | |
| max_count_idx = argmax(counts) | |
| max_count = counts[max_count_idx] | |
| # Sample from the component with maximum count first and use it directly | |
| x = rand(rng, component(d, max_count_idx), max_count) | |
| # Resize to the full size and continue with other components | |
| resize!(x, n) | |
| offset = max_count | |
| for i in eachindex(counts) | |
| if i != max_count_idx | |
| ni = counts[i] | |
| if ni > 0 | |
| c = component(d, i) | |
| last_offset = offset + ni - 1 | |
| rand!(rng, c, @view(x[(begin+offset):(begin+last_offset)])) | |
| offset = last_offset + 1 | |
| end | |
| end | |
| end | |
| offset = 0 | |
| for (c, ni) in zip(components(d), counts) | |
| last_offset = offset + ni - 1 | |
| rand!(rng, c, @view(x[(begin+offset):(begin+last_offset)])) | |
| offset = last_offset + 1 | |
| end |
| end | ||
| end | ||
|
|
||
| function rand(rng::AbstractRNG, d::Truncated, n::Int) |
There was a problem hiding this comment.
| function rand(rng::AbstractRNG, d::Truncated, n::Int) | |
| function _rand!(rng::AbstractRNG, d::Truncated, x::AbstractArray{<:Real}) |
And if there's any precomputations that could be factored out (doesn't seem to be the case?), then we should think about defining a dedicated sampler.
| if tp > 0.25 | ||
| # Regime 1: Rejection sampling with batch optimization | ||
| # Get the correct type and memory by sampling from the untruncated distribution | ||
| samples = rand(rng, d0, n) |
There was a problem hiding this comment.
| samples = rand(rng, d0, n) | |
| rand!(rng, d0, x) |
| samples = rand(rng, d0, n) | ||
| n_collected = 0 | ||
| max_batch = 0 | ||
| batch_buffer = Vector{eltype(samples)}() |
There was a problem hiding this comment.
A separate batch buffer seems unnecessary, in particular the resizing might be inefficient? Instead of copying from a separate vector we could just use the output vector and move samples around there and keep track of the last accepted index.
| # Sample one value first to determine the correct type | ||
| sample_type = typeof(quantile(d0, d.lcdf + rand(rng) * d.tp)) | ||
| samples = Vector{sample_type}(undef, n) |
There was a problem hiding this comment.
| # Sample one value first to determine the correct type | |
| sample_type = typeof(quantile(d0, d.lcdf + rand(rng) * d.tp)) | |
| samples = Vector{sample_type}(undef, n) |
| sample_type = typeof(quantile(d0, d.lcdf + rand(rng) * d.tp)) | ||
| samples = Vector{sample_type}(undef, n) | ||
| for i in 1:n | ||
| samples[i] = quantile(d0, d.lcdf + rand(rng) * d.tp) |
There was a problem hiding this comment.
We should probably at least use a Random.Sampler for rand(rng), maybe it's even faster to call rand(rng, n) (despite the allocation.
| # Sample one value first to determine the correct type | ||
| sample_type = typeof(invlogcdf(d0, logaddexp(d.loglcdf, d.logtp - randexp(rng)))) | ||
| samples = Vector{sample_type}(undef, n) |
There was a problem hiding this comment.
| # Sample one value first to determine the correct type | |
| sample_type = typeof(invlogcdf(d0, logaddexp(d.loglcdf, d.logtp - randexp(rng)))) | |
| samples = Vector{sample_type}(undef, n) |
Co-authored-by: David Müller-Widmann <devmotion@users.noreply.github.com>
Closes #1984
Implementation
MixtureModelTruncatedCheckpoints
cdf(::Skellam, ::Real)#1986)Sanity checks
Speed
For mixture model,
For truncated
Visual