Skip to content

Conversation

regexident
Copy link

@regexident regexident commented Jul 7, 2025

Motivation: Right now (afaict) proptest does not provide a way to efficiently (both, CPU-, but also RAM-wise) sample a small number of values from a much, much larger sampling space without replacement.

To give an actual concrete motivational example where such a strategy would be very useful: Generating random graphs with unique(!) nodes and unique(!) edges, where the sampling space of possible source->target edge endpoints (in a graph without parallel edges) grows super-linearly with O(n * n) where n is the number of nodes.

There is of course prop::sample::subsequence(), but it requires an actual concrete, slice-like sampling source to be passed in, which for graphs with a couple of thousand nodes becomes unusable rather quickly. Alternatively there is prop::collection::hash_set()/::btree_set(), but those tends to degrade in performance for high saturation scenarios (i.e. where the sampling count is close or equal to the sampling space's size).


Often times one happens to find oneself in the luck situation however, where one could derive sample values from their indices into anon-materialized collection.

So what if we could exploit this and sample from a Range<usize> of such indices and then materialized the actual sample values only after the sampling, thus avoiding the need for a pre-materialized vec of sample-values altogether (in exchange for a relatively small computational overhead per sample)?

This gap is what this PR aims to fill: It implements a memory-efficient strategy for sampling subsets of values from Range<T>, by making use of a lazy Fisher-Yates shuffling step. As a result the strategy's SPACE/TIME complexities grow linearly with the effective sample count, rather than the sampling space's size.


The use of range.nth(n) looks like O(n), but ends up being O(1) in practice, thanks to specialization:
https://godbolt.org/z/azWYqKEoE


Feel free to suggest better API names! Same for the actual location of the code. Should it be part of crate::sample?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant