Memory-efficient strategy for sampling subsets from ranges without replacement #586

regexident · 2025-07-07T14:16:38Z

Motivation: Right now (afaict) proptest does not provide a way to efficiently (both, CPU-, but also RAM-wise) sample a small number of values from a much, much larger sampling space without replacement.

To give an actual concrete motivational example where such a strategy would be very useful: Generating random graphs with unique(!) nodes and unique(!) edges, where the sampling space of possible source->target edge endpoints (in a graph without parallel edges) grows super-linearly with O(n * n) where n is the number of nodes.

There is of course prop::sample::subsequence(), but it requires an actual concrete, slice-like sampling source to be passed in, which for graphs with a couple of thousand nodes becomes unusable rather quickly. Alternatively there is prop::collection::hash_set()/::btree_set(), but those tends to degrade in performance for high saturation scenarios (i.e. where the sampling count is close or equal to the sampling space's size).

Often times one happens to find oneself in the luck situation however, where one could derive sample values from their indices into anon-materialized collection.

So what if we could exploit this and sample from a Range<usize> of such indices and then materialized the actual sample values only after the sampling, thus avoiding the need for a pre-materialized vec of sample-values altogether (in exchange for a relatively small computational overhead per sample)?

This gap is what this PR aims to fill: It implements a memory-efficient strategy for sampling subsets of values from Range<T>, by making use of a lazy Fisher-Yates shuffling step. As a result the strategy's SPACE/TIME complexities grow linearly with the effective sample count, rather than the sampling space's size.

The use of range.nth(n) looks like O(n), but ends up being O(1) in practice, thanks to specialization:
https://godbolt.org/z/azWYqKEoE

Feel free to suggest better API names! Same for the actual location of the code. Should it be part of crate::sample?

…out replacement

Add a memory-efficient strategy for sampling subsets from ranges with…

a3aca49

…out replacement

regexident mentioned this pull request Jul 7, 2025

feat!: Add optional support for proptest with a bunch of composable strategies petgraph/petgraph#842

Open

regexident force-pushed the range-subset-strategy branch from 553c85d to a3aca49 Compare July 8, 2025 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory-efficient strategy for sampling subsets from ranges without replacement #586

Memory-efficient strategy for sampling subsets from ranges without replacement #586

Uh oh!

regexident commented Jul 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Memory-efficient strategy for sampling subsets from ranges without replacement #586

Are you sure you want to change the base?

Memory-efficient strategy for sampling subsets from ranges without replacement #586

Uh oh!

Conversation

regexident commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

regexident commented Jul 7, 2025 •

edited

Loading