how to pre-generate a list of seeds and then be able to subset that list based on subsets of the parallelized x? #811

almita · 2025-09-19T02:54:48Z

almita
Sep 19, 2025

I'm using future_map from the furrr package to parallelize a function I made to estimate a value A (via bootstrapping) across a list of data frames (x), where each data frame corresponds to one organism. I set a single seed with future_options, however I ran into the problem that when I ran this same script on a subset of that list of data frames (meaning not all organisms where included) the estimated A for any organism was different than the A obtained for that same organism when I ran it on the full list.

So I thought I could maybe generate a list of seeds, and have each seed assigned to each organism so that if I subset x I can subset the seed list as well and always get the same results per organism regardless of the order/number of organisms in the data frame list. I tried this with the second method you showed for generating valid .Random.seed sequences, so I generated a list of valid seeds, whose length is the length of the unfiltered x. Each entry of the seed list gets assigned to an organism in a dataframe:

tibble [1,000 × 2] (S3: tbl_df/tbl/data.frame)
 $ Feature: chr [1:1000] "Organism1" "Organism2" "Organism3"...
 $ seed   :List of 1000
  ..$ : int [1:7] 10407 -201059400 2017564692 -1357836508 -893625317 75316099 -292103706
  ..$ : int [1:7] 10407 -1967465775 -1292100823 -436659697 1685432162 1962406901 -325217324
  ..$ : int [1:7] 10407 -1714507534 -1147213612 -1526497577 1510356746 -1980160688 -1578989544
  .. [list output truncated]

So when I want to run my function on a subset of organisms I just filter the data frame above and get the filtered seeds list. But it didn't work, i.e. the A value for Organism1 was different than the A value for Organism1 in the filtered x run.

Does anyone have any idea how to approach this problem? I hope I made myself clear, and also it could be that I'm just misunderstanding how the pre-generated seeds work. Would like to know what you think!

I guess I should also note that when I repeatedly run the same x (filtered or non-filtered, but just keeping it the same) the A values stay the same, which I suppose is expected. Also another note is the reason why I'm subsetting is to reduce the number of hypothesis tests done inside my function, but that doesn't have anything to do with estimating A.

HenrikBengtsson · 2025-09-23T20:05:06Z

HenrikBengtsson
Sep 23, 2025
Maintainer

Hi. This is an interesting use case. It should be doable, but I'd need to dig into the details, which will at least two weeks before I have the time.

A workaround would be to set everything up as you'd be rerunning everything from scratch, but then detect if the results have already been computed and return early (e.g. with a NULL value). That will give you numerically reproducible results. The downside is that you'll pay the overhead price for launching parallel tasks that will immediately return.

It's not clear to me how you decide how something needs to be re-computed or not. It sounds like you know this information upfront. For instance, say your list of input data.frames is called data. Then, if you have a logical vector of todo such that length(todo) == length(data), then you can pass both those pieces of information using future_map2(), e.g.

res <- future_map2(data, todo, function(x, do) {
  if (!do) return(NULL)
  ...
})

Rather technical detouring comment: It would be nice if if (!do) return(NULL) could be run in the main R session, so we wouldn't have to waste the overhead of sending data to a parallel worker just for it to return immediately. That is actually a feature I have on the roadmap since way back. It's a matter of coming up with a way how to declare what could/should be run as "prologue" code.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Futureverse

how to pre-generate a list of seeds and then be able to subset that list based on subsets of the parallelized x? #811

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Futureverse

how to pre-generate a list of seeds and then be able to subset that list based on subsets of the parallelized x? #811

Uh oh!

almita Sep 19, 2025

Replies: 1 comment

Uh oh!

HenrikBengtsson Sep 23, 2025 Maintainer

almita
Sep 19, 2025

HenrikBengtsson
Sep 23, 2025
Maintainer