[help] Memory-efficient filtering of large dynamic branching targets within a single pipeline #1549

uhkeller · 2025-11-17T19:41:29Z

uhkeller
Nov 17, 2025

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

I'm trying to find a way to efficiently retrieve a subset of results from a large target in a pipeline without having to load the entire target into memory. Specifically, I have a target that produces list columns with a lot of data, and I want to filter specific elements based on some criteria computed in much smaller summary target. I then want to have additional targets downstream working on the filtered subset.

It's straight forward to do this using tar_read() outside of the pipeline. But this approach would mean that I'd have to split up the project into multiple targets projects, which is inconvenient. Another approach might be switching to static branching, but that would become unwieldy with my actual data.

Is there any way to achieve this within a single targets pipeline?

Here's a toy example to clarify what I'm after:

library(targets)

tar_script(
  {
    library(targets)
    library(tarchetypes)

    tar_option_set(packages = c("dplyr"))

    list(
      tar_group_by(
        data,
        tibble(group = rep(letters[1:10], each = 10), x = rnorm(100)),
        group,
        iteration = "group"
      ),
      tar_target(
        huge_results,
        data |>
          summarise(group = group[1], really_big = list(x^2)),
        pattern = map(data)
      ),
      tar_target(
        tiny_summaries,
        huge_results |>
          mutate(really_small = mean(unlist(really_big))) |> 
          select(group, really_small),
        pattern = map(huge_results)
      )
    )
  },
  ask = FALSE
)

tar_make(reporter = "silent")

# Efficiently retrieve subset of huge_results outside of the pipeline
select_index <- order(tar_read(tiny_summaries)$really_small)[1:2]
tar_read(huge_results, select_index)
#> # A tibble: 2 × 2
#>   group really_big
#>   <chr> <list>    
#> 1 c     <dbl [10]>
#> 2 e     <dbl [10]>

^{Created on 2025-11-17 with reprex v2.1.1}

noamross · 2025-11-20T00:26:05Z

noamross
Nov 20, 2025
Maintainer

This is a bit fragile, but retrieval = "none" leaves loading the targets fully up the the user. With tar_branches(), you can look up the branch names for each index number, and then load the objects manually. If you are using different formats you will want something other than readRDS() to load the data:

library(targets)

tar_script(
  {
    library(targets)
    library(tarchetypes)

    tar_option_set(packages = c("dplyr"))

    list(
      tar_group_by(
        data,
        tibble(group = rep(letters[1:10], each = 10), x = rnorm(100)),
        group,
        iteration = "group"
      ),
      tar_target(
        huge_results,
        data |>
          summarise(group = group[1], really_big = list(x^2)),
        pattern = map(data)
      ),
      tar_target(
        tiny_summaries,
        huge_results |>
          mutate(really_small = mean(unlist(really_big))) |>
          select(group, really_small),
        pattern = map(huge_results)
      ),
      tar_target(
        select_index,
        order(tiny_summaries$really_small)[1:2]
      ),
      tar_target(
        huge_subset,
        {
          index <- readRDS(tar_path_target("select_index"))
          objects <- tar_branches(huge_results)$huge_results[index]
          lapply(objects, function(x) {
            # Using this internal because I can't figure out how deal with `tar_path_target()`'s non-standard eval
            readRDS(tar_runtime_object()$meta$get_record(x)$path)
          }) |>
            bind_rows()
        },
        retrieval = "none"
      )
    )
  },
  ask = FALSE
)

tar_make()

tar_read(huge_subset)

0 replies

uhkeller · 2025-11-20T07:02:24Z

uhkeller
Nov 20, 2025
Author

Edit: The fix for the problem described below is to simply not quote select_index in the call to tar_path_target()!

Thank you! Although it turns out when starting from a clean slate the solution is even more fragile than I expected:

> tar_make()
+ data dispatched                           
✔ data completed [9ms, 1.07 kB]
+ huge_results declared [10 branches]
✔ huge_results completed [5ms, 2.44 kB]                     
+ tiny_summaries declared [10 branches]                     
✔ tiny_summaries completed [13ms, 1.62 kB]                    
+ huge_subset dispatched                                      
✖ huge_subset errored                                         
✖ errored pipeline [261ms, 21 completed, 0 skipped]           
Warning messages:
1: cannot open compressed file '_targets/objects/select_index', probable reason 'No such file or directory'

The issue is that targets can't discover the dependency of huge_subset on select_index, so the pipeline only runs when select_index is already present.

I've come up with my own workaround which involves using a branched file target for huge_results. In addition to not using targets internals, this also has the advantage that for assembling huge_subset_processed I don't even have to load all the filtered results into memory at once. The big disadvantage is that I'm duplicating the functionality of targets in a way, plus the file targets introduce additional complexity. I have to admit I don't quite understand why huge_subset_files can't be a file target itself. If I make it one, then huge_subset_processed, huge_subset_files_files, and huge_subset_files are forever outdated.

library(targets)

tar_script(
  {
    library(targets)
    library(tarchetypes)

    huge_results_folder <- "_targets/user/huge_results"
    dir.create(huge_results_folder, showWarnings = FALSE, recursive = TRUE)

    tar_option_set(packages = c("dplyr"))

    list(
      tar_group_by(
        data,
        tibble(group = rep(letters[1:10], each = 10), x = rnorm(100)),
        group,
        iteration = "group"
      ),
      tar_target(
        huge_results_files,
        {
          path <- paste0(huge_results_folder, data$tar_group[1], ".rds")
          res <-
            data |>
            summarise(group = group[1], really_big = list(x^2))
          saveRDS(res, path)
          path
        },
        format = "file",
        pattern = map(data)
      ),
      tar_target(
        huge_results_cleanup,
        list.files(huge_results_folder, full.names = TRUE) |>
          setdiff(huge_results_files) |>
          unlink()
      ),
      tar_target(
        tiny_summaries,
        huge_results_files |>
          readRDS() |>
          mutate(really_small = mean(unlist(really_big))) |>
          select(group, really_small),
        pattern = map(huge_results_files)
      ),
      tar_target(
        huge_subset_files,
        {
          huge_results_files[order(tiny_summaries$really_small)[1:2]]
        },
      ),
      tar_target(
        huge_subset_processed,
        readRDS(huge_subset_files) |>
          mutate(sum_really_big = sum(unlist(really_big)), .keep = "unused"),
        pattern = map(huge_subset_files)
      )
    )
  },
  ask = FALSE
)

tar_destroy(ask = FALSE)
tar_make()
tar_read(huge_subset_processed)

I really wish there was a clean, proper way to achieve this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[help] Memory-efficient filtering of large dynamic branching targets within a single pipeline #1549

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

[help] Memory-efficient filtering of large dynamic branching targets within a single pipeline #1549

Uh oh!

uhkeller Nov 17, 2025

Help

Description

Replies: 2 comments

Uh oh!

noamross Nov 20, 2025 Maintainer

Uh oh!

Uh oh!

uhkeller Nov 20, 2025 Author

uhkeller
Nov 17, 2025

noamross
Nov 20, 2025
Maintainer

uhkeller
Nov 20, 2025
Author