Skip to content

Dispatch for drawing multiples#1985

Open
mmikhasenko wants to merge 16 commits intoJuliaStats:masterfrom
mmikhasenko:rand-n-dispatch
Open

Dispatch for drawing multiples#1985
mmikhasenko wants to merge 16 commits intoJuliaStats:masterfrom
mmikhasenko:rand-n-dispatch

Conversation

@mmikhasenko
Copy link

@mmikhasenko mmikhasenko commented Jun 17, 2025

Closes #1984

Implementation

  • Dispatch for MixtureModel
  • Dispatch for Truncated

Checkpoints

Sanity checks

Speed

For mixture model,

julia> using Distributions
julia> using BenchmarkTools

julia> d = MixtureModel([Normal(1,0.1), Normal(2,0.1), Normal(3,0.1)], [0.3,0.3,0.4])
julia> @btime rand($d, 10_000);
  73.708 μs (7 allocations: 125.67 KiB)

julia> @btime [rand($d) for _ in 1:10_000];
  124.125 μs (3 allocations: 96.06 KiB)

For truncated

t = truncated(Normal(), 0, 3)
@btime rand($t, 10_000);  # 114.625 μs (6 allocations: 256.12 KiB)
@btime [rand($t) for _ in 1:10_000];  # 169.875 μs (3 allocations: 96.06 KiB)

Visual

d = MixtureModel([Normal(1,0.1), Normal(2,0.1), Normal(3,0.1)], [0.3,0.3,0.4])
s1 = [rand(d) for _ in 1:10000]
s2 = rand(d, 10000)
let
    plot(); 
    stephist!(s1, bins=range(0, 4, 100), lab="rand(d) n times")
    stephist!(s2, bins=range(0, 4, 100), lab="rand(d,n)")
end

image

t = truncated(Normal(), 0, 3)
r1 = [rand(t) for _ in 1:10000]
r2 = rand(t, 10000)
let
    plot(); 
    stephist!(t1, bins=range(0, 4, 100), lab="rand(d) n times")
    stephist!(t2, bins=range(0, 4, 100), lab="rand(d,n)")
end

image

@mmikhasenko mmikhasenko marked this pull request as draft June 17, 2025 20:52
@codecov-commenter
Copy link

codecov-commenter commented Jun 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.42%. Comparing base (f1ff9e8) to head (ad53e2d).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1985      +/-   ##
==========================================
- Coverage   86.44%   86.42%   -0.03%     
==========================================
  Files         147      147              
  Lines        8838     8898      +60     
==========================================
+ Hits         7640     7690      +50     
- Misses       1198     1208      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
mmikhasenko and others added 5 commits June 18, 2025 16:00
Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
@mmikhasenko
Copy link
Author

mmikhasenko commented Jun 18, 2025

@devmotion thanks a lot for the review.

  • Questions should be addressed.
  • truncated of small mass: I've fixed broken pipeline for large truncated of small remaining volume - now, it falls back to rand(n) if needs to generate more than 10M, this cut value is somewhat arbitrary, so suggestions are welcome.
  • docstrings are removed.
  • made sanity checks of comparing samples of rand(d) and rand(d,n)

No docs are added

@mmikhasenko
Copy link
Author

The last item that I think about is the 10M cap for switching between algorithm.
It's bad to switch between algorithms in general for predictability, so the design of Truncated with a switch is not ideal.
But we will not address it in this MR.

I'm thinking to proceed with a similar if/else of 0.25% of distribution. Similar (not-ideal) patterns will be easy to identify in feature and update.

@mmikhasenko
Copy link
Author

@devmotion thanks for the last comments.

  • the truncated was easy to fix,
  • I've modified the code rand + resize strategy for the mixture model

@mmikhasenko mmikhasenko marked this pull request as ready for review June 23, 2025 19:31
@mmikhasenko
Copy link
Author

Re-evaluated benchmarks in the header,
the rand-n version is faster for both models

@mmikhasenko
Copy link
Author

Comments are resolved, this PR is ready for review

@devmotion
Copy link
Member

As a general comment before any detailed review:

the rand-n version is faster for both models

Can we test different number of samples? n = 10000 is a bit extreme. Can we check n = 1, n = 5, n = 10, n = 50, n = 100, n = 500, n = 1000 etc. as well?

@mmikhasenko
Copy link
Author

It would be great to proceed with merging.
The issue is blocking usage of the package.

@devmotion could you please check, if there is anything essential to test/fix?

thanks

rand(rng::AbstractRNG, d::MixtureModel{Univariate}) =
rand(rng, component(d, rand(rng, d.prior)))

function rand(rng::AbstractRNG, d::MixtureModel{Univariate}, n::Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong dispatch, isn't it? If one wants to draw multiple samples from a distribution d, automatically Distributions dispatches to drawing samples with sampler(d). In the case of mixture models this is MixtureSampler. So this should actually be defined for MixtureSampler{Univariate}, not MixtureModel{Univariate} AFAICT?

However, generally for univariate distributions one also shouldn't define rand(rng, dist, n) but only the in-place method _rand!(rng, sampler(dist), out) (#1905 will fix possibly incorrectly allocated output arrays) if multiple samples can be generated more efficiently:

return rand!(rng, sampler(s), out)
function rand!(rng::AbstractRNG, s::Sampleable{Univariate}, A::AbstractArray{<:Real})
return _rand!(rng, sampler(s), A)
end
function _rand!(rng::AbstractRNG, sampler::Sampleable{Univariate}, A::AbstractArray{<:Real})
for i in eachindex(A)
A[i] = rand(rng, sampler)
end
return A
end

So AFAICT we should only define

Suggested change
function rand(rng::AbstractRNG, d::MixtureModel{Univariate}, n::Int)
function _rand!(rng::AbstractRNG, d::MixtureSampler{Univariate}, x::AbstractArray{<:Real})

Alternatively, if we never want to use MixtureSampler for sampling MixtureModel{Univariate}, we should define sampler(d::MixtureModel{Univariate}) = d (or limit the definition below to sampler(d::MixtureModel{Multivariate}) = MixtureSampler(d)) and here define

Suggested change
function rand(rng::AbstractRNG, d::MixtureModel{Univariate}, n::Int)
function _rand!(rng::AbstractRNG, d::MixtureModel{Univariate}, x::AbstractArray{<:Real})

Comment on lines +483 to +504
# Find the component with the maximum count to minimize resizing
max_count_idx = argmax(counts)
max_count = counts[max_count_idx]

# Sample from the component with maximum count first and use it directly
x = rand(rng, component(d, max_count_idx), max_count)

# Resize to the full size and continue with other components
resize!(x, n)
offset = max_count

for i in eachindex(counts)
if i != max_count_idx
ni = counts[i]
if ni > 0
c = component(d, i)
last_offset = offset + ni - 1
rand!(rng, c, @view(x[(begin+offset):(begin+last_offset)]))
offset = last_offset + 1
end
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the in-place method, it seems this could be simplified to

Suggested change
# Find the component with the maximum count to minimize resizing
max_count_idx = argmax(counts)
max_count = counts[max_count_idx]
# Sample from the component with maximum count first and use it directly
x = rand(rng, component(d, max_count_idx), max_count)
# Resize to the full size and continue with other components
resize!(x, n)
offset = max_count
for i in eachindex(counts)
if i != max_count_idx
ni = counts[i]
if ni > 0
c = component(d, i)
last_offset = offset + ni - 1
rand!(rng, c, @view(x[(begin+offset):(begin+last_offset)]))
offset = last_offset + 1
end
end
end
offset = 0
for (c, ni) in zip(components(d), counts)
last_offset = offset + ni - 1
rand!(rng, c, @view(x[(begin+offset):(begin+last_offset)]))
offset = last_offset + 1
end

end
end

function rand(rng::AbstractRNG, d::Truncated, n::Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function rand(rng::AbstractRNG, d::Truncated, n::Int)
function _rand!(rng::AbstractRNG, d::Truncated, x::AbstractArray{<:Real})

And if there's any precomputations that could be factored out (doesn't seem to be the case?), then we should think about defining a dedicated sampler.

if tp > 0.25
# Regime 1: Rejection sampling with batch optimization
# Get the correct type and memory by sampling from the untruncated distribution
samples = rand(rng, d0, n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
samples = rand(rng, d0, n)
rand!(rng, d0, x)

samples = rand(rng, d0, n)
n_collected = 0
max_batch = 0
batch_buffer = Vector{eltype(samples)}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate batch buffer seems unnecessary, in particular the resizing might be inefficient? Instead of copying from a separate vector we could just use the output vector and move samples around there and keep track of the last accepted index.

Comment on lines +275 to +277
# Sample one value first to determine the correct type
sample_type = typeof(quantile(d0, d.lcdf + rand(rng) * d.tp))
samples = Vector{sample_type}(undef, n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Sample one value first to determine the correct type
sample_type = typeof(quantile(d0, d.lcdf + rand(rng) * d.tp))
samples = Vector{sample_type}(undef, n)

sample_type = typeof(quantile(d0, d.lcdf + rand(rng) * d.tp))
samples = Vector{sample_type}(undef, n)
for i in 1:n
samples[i] = quantile(d0, d.lcdf + rand(rng) * d.tp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably at least use a Random.Sampler for rand(rng), maybe it's even faster to call rand(rng, n) (despite the allocation.

Comment on lines +284 to +286
# Sample one value first to determine the correct type
sample_type = typeof(invlogcdf(d0, logaddexp(d.loglcdf, d.logtp - randexp(rng))))
samples = Vector{sample_type}(undef, n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Sample one value first to determine the correct type
sample_type = typeof(invlogcdf(d0, logaddexp(d.loglcdf, d.logtp - randexp(rng))))
samples = Vector{sample_type}(undef, n)

Co-authored-by: David Müller-Widmann <devmotion@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rand(d, 1000) is not propagated to components

3 participants