External databases for metadata management #1463

wlandau · 2025-01-10T20:01:18Z

wlandau
Jan 10, 2025
Maintainer

targets uses simple text files for metadata (pipe-separated values). These files can get large in pipelines with many targets (#1390) and appending to them creates overhead in tar_make(). Maybe targets can instead use a DuckDB database for metadata.

r2evans · 2025-03-20T11:56:14Z

r2evans
Mar 20, 2025

Side note for this: if you have more than the main process writing to the database, you will need to deal with the access-contention. DuckDB supports either multiple readonly readers, or only one writeable connection, you cannot have one connection write while any other connection is reading. See Concurrency.

0 replies

wlandau · 2025-03-23T21:01:01Z

wlandau
Mar 23, 2025
Maintainer Author

Thanks for pointing that out. This may be a deal-breaker for metadata. Redis seems like another good option if the profiling data shows a speedup. Though I don't think I'd make Redis the default because it can be a burden to install.

0 replies

r2evans · 2025-03-23T23:38:55Z

r2evans
Mar 23, 2025

I have considered duckdb for many larger-scale projects where having a local quasi-permanent store would be good, but the concurrency issue is a deal-breaker, and not one they seem eager to remedy, unfortunately. Redis would be easy enough for many, but I agree that even as easy as it is, running a redis instance "just for this" might be more than some people prefer.

If you're considering Redis but don't want the server overhead (and perhaps would like on-disk persistence), rrlite can be a good option. Not just because it doesn't require a server, but also because the difference between rrlite and redux is just the initial connection ... otherwise it's intended to be API-compatible with redux, except file-based (or :memory:-based).

I haven't benchmarked it nor verified its concurrency (other than finding seppo0010/rlite#13 (comment)). Admittedly, it hasn't seen commits in many years, I don't know if that means it's awesome-stable or not. richfitx is the author/maintainer (also maintains redux).

0 replies

wlandau · 2025-04-08T14:12:35Z

wlandau
Apr 8, 2025
Maintainer Author

I have been profiling example pipelines, and the bottleneck seems to be reopening the base::file() connection for each line appended to _targets/meta/meta. It would be much faster to keep a persistent open connection, but it may interfere with other connections that read the file. I will test.

0 replies

wlandau · 2025-04-08T14:44:09Z

wlandau
Apr 8, 2025
Maintainer Author

Maintaining a persistent connection seems to reduce execution time from around 60 seconds down to around 23 seconds on an M2 Mac in the following 10000-target pipeline:

library(targets)
tar_option_set(
  controller = crew::crew_controller_local(workers = 25L)
)
list(
  tar_target(datasets, seq_len(1e4), memory = "persistent"),
  tar_target(models, datasets, pattern = map(datasets), retrieval = "main")
)

0 replies

wlandau · 2025-04-08T14:53:49Z

wlandau
Apr 8, 2025
Maintainer Author

There's still a ~30% bottleneck in cat() when appending though.

0 replies

wlandau · 2025-04-08T15:02:02Z

wlandau
Apr 8, 2025
Maintainer Author

Negligible improvement with file(blocking = FALSE).

0 replies

wlandau · 2025-04-08T19:24:14Z

wlandau
Apr 8, 2025
Maintainer Author

There's still a ~30% bottleneck in cat() when appending though.

This is only when targets complete instantaneously.

Moving to a database may be necessary, but on reflection, it is a bit extreme. Converting to a discussion.

0 replies

r2evans · 2025-04-08T23:20:33Z

r2evans
Apr 8, 2025

Since you had considered redis, have you looked at rrlite? I don't have your large-scale tests to benchmark, but I'd think that if it performs anywhere near Redis's LPUSH (which is O(1)), it could be good, even if it hasn't seen updates. (Admittedly, I'm mostly just a bringer-of-good-ideas on this one ... I don't use rrlite, and for something like this I would likely use redux with a proper Redis/Valkey instance.)

1 reply

wlandau Apr 22, 2025
Maintainer Author

Haven't tried rrlite, but it definitely sounds worth looking into.

lgaborini · 2025-12-25T22:09:41Z

lgaborini
Dec 25, 2025

I'm encountering a hard-to-track bug, I'm running a targets pipeline on a Shiny server hosted inside Azure App Service.
The pipeline randomly fails to record its status onto the _targets/meta/meta, possibly due to some race conditions (all perms should be granted, I can also file.create it) or something else going on.

2025-12-25T21:58:05.9091595Z In file.create(to[okay]) :
2025-12-25T21:58:05.9091629Z   cannot create file '/home/_targets/meta/meta', reason 'No such file or directory'

Moving the metadata file to a proper database might be a solution.

0 replies

External databases for metadata management #1463

Uh oh!

wlandau Jan 10, 2025 Maintainer

Replies: 10 comments · 1 reply

Uh oh!

r2evans Mar 20, 2025

Uh oh!

wlandau Mar 23, 2025 Maintainer Author

Uh oh!

r2evans Mar 23, 2025

Uh oh!

Uh oh!

wlandau Apr 8, 2025 Maintainer Author

Uh oh!

wlandau Apr 8, 2025 Maintainer Author

Uh oh!

wlandau Apr 8, 2025 Maintainer Author

Uh oh!

wlandau Apr 8, 2025 Maintainer Author

Uh oh!

wlandau Apr 8, 2025 Maintainer Author

Uh oh!

r2evans Apr 8, 2025

Uh oh!

wlandau Apr 22, 2025 Maintainer Author

Uh oh!

lgaborini Dec 25, 2025

wlandau
Jan 10, 2025
Maintainer

Replies: 10 comments 1 reply

r2evans
Mar 20, 2025

wlandau
Mar 23, 2025
Maintainer Author

r2evans
Mar 23, 2025

wlandau
Apr 8, 2025
Maintainer Author

wlandau
Apr 8, 2025
Maintainer Author

wlandau
Apr 8, 2025
Maintainer Author

wlandau
Apr 8, 2025
Maintainer Author

wlandau
Apr 8, 2025
Maintainer Author

r2evans
Apr 8, 2025

wlandau Apr 22, 2025
Maintainer Author

lgaborini
Dec 25, 2025