-
Notifications
You must be signed in to change notification settings - Fork 67
Description
First, thank you for the pins package; In mty team, at the World Bank pins had become central to our team’s pipelines. Because we rely on it heavily, the behavior below is creating a lot of unnecessary version churn and downstream instability for us.
Summary
pins appears to decide “has this changed?” by hashing the file bytes (via digest::digest()), not the object. With qs, two saves of the same object can produce slightly different bytes, so the file hash changes and pins creates unnecessary new versions. Using secretbase::siphash13 yields stable file hashes; using qs2 (deterministic settings) also helps. I think an opt-in stable hashing path could be a good solution.
Why it happens
digest::digest(file)is sensitive to byte-level variations in serialized output.qsmay change bytes across saves despite identical content.siphash13(file=…)is stable in these cases;qs2also enables deterministic serialization.
Reprex (simulates pins’ file-hash check)
# Same object, saved twice with qs
x <- list(a=1:3, b=list(y="z"))
f1 <- tempfile(fileext=".qs")
f2 <- tempfile(fileext=".qs")
qs::qsave(x, f1, check_hash=TRUE)
qs::qsave(x, f2, check_hash=TRUE)
# File-level digests differ → would look like "changed" to pins
digest::digest(f1, algo="xxhash64")
#> [1] "54db9bb61f3033fa"
digest::digest(f2, algo="xxhash64")
#> [1] "65c7307f0cebca90"
# Stable alternative: siphash13 (file-level stable for identical content)
secretbase::siphash13(file=f1)
#> [1] "06df1cf71318cbac"
secretbase::siphash13(file=f2)
#> [1] "06df1cf71318cbac"
# Objects are identical
waldo::compare(qs::qread(f1), qs::qread(f2))
#> ✔ No differences
# ✔ No differencesCreated on 2025-10-27 with reprex v2.1.1
(I also tested qs2::qs_save(..., shuffle = FALSE): object-level hashes match deterministically; file-byte digest can still differ, while siphash13 remains stable.)
Proposal (any of the following)
- A. Allow custom hash function (e.g.,
options(pins.hash_fn = ...)or a board arghash_fn=) so users can choosesecretbase::siphash13or an object-level hasher. - B. Built-in stable option: expose
hash = c("digest", "siphash13", "object"); keep current default for compatibility. - C. Serializer profile for
qs2with deterministic settings (e.g.,shuffle = FALSE), optionally combined with B. - D. Object-level hashing (deserialize then hash) as an opt-in for maximal reproducibility.
I think this suggestion will avoid spurious versions, storage churn, and non-reproducible pipelines when using pins as a content-addressed cache. Thank you so much for your support.