Memory-efficient strategy for sampling subsets from ranges without replacement #586
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation: Right now (afaict)
proptest
does not provide a way to efficiently (both, CPU-, but also RAM-wise) sample a small number of values from a much, much larger sampling space without replacement.To give an actual concrete motivational example where such a strategy would be very useful: Generating random graphs with unique(!) nodes and unique(!) edges, where the sampling space of possible
source->target
edge endpoints (in a graph without parallel edges) grows super-linearly withO(n * n)
wheren
is the number of nodes.There is of course
prop::sample::subsequence()
, but it requires an actual concrete, slice-like sampling source to be passed in, which for graphs with a couple of thousand nodes becomes unusable rather quickly. Alternatively there isprop::collection::hash_set()
/::btree_set()
, but those tends to degrade in performance for high saturation scenarios (i.e. where the sampling count is close or equal to the sampling space's size).Often times one happens to find oneself in the luck situation however, where one could derive sample values from their indices into anon-materialized collection.
So what if we could exploit this and sample from a
Range<usize>
of such indices and then materialized the actual sample values only after the sampling, thus avoiding the need for a pre-materialized vec of sample-values altogether (in exchange for a relatively small computational overhead per sample)?This gap is what this PR aims to fill: It implements a memory-efficient strategy for sampling subsets of values from
Range<T>
, by making use of a lazy Fisher-Yates shuffling step. As a result the strategy's SPACE/TIME complexities grow linearly with the effective sample count, rather than the sampling space's size.The use of
range.nth(n)
looks likeO(n)
, but ends up beingO(1)
in practice, thanks to specialization:https://godbolt.org/z/azWYqKEoE
Feel free to suggest better API names! Same for the actual location of the code. Should it be part of
crate::sample
?