Skip to content

Stable hashing for pins: file-byte hashes cause false versioning with qs; siphash13/qs2 fix it #882

@randrescastaneda

Description

@randrescastaneda

First, thank you for the pins package; In mty team, at the World Bank pins had become central to our team’s pipelines. Because we rely on it heavily, the behavior below is creating a lot of unnecessary version churn and downstream instability for us.

Summary
pins appears to decide “has this changed?” by hashing the file bytes (via digest::digest()), not the object. With qs, two saves of the same object can produce slightly different bytes, so the file hash changes and pins creates unnecessary new versions. Using secretbase::siphash13 yields stable file hashes; using qs2 (deterministic settings) also helps. I think an opt-in stable hashing path could be a good solution.

Why it happens

  • digest::digest(file) is sensitive to byte-level variations in serialized output.
  • qs may change bytes across saves despite identical content.
  • siphash13(file=…) is stable in these cases; qs2 also enables deterministic serialization.

Reprex (simulates pins’ file-hash check)

# Same object, saved twice with qs
x <- list(a=1:3, b=list(y="z"))
f1 <- tempfile(fileext=".qs") 
f2 <- tempfile(fileext=".qs")
qs::qsave(x, f1, check_hash=TRUE) 
qs::qsave(x, f2, check_hash=TRUE)

# File-level digests differ → would look like "changed" to pins
digest::digest(f1, algo="xxhash64")
#> [1] "54db9bb61f3033fa"
digest::digest(f2, algo="xxhash64")
#> [1] "65c7307f0cebca90"

# Stable alternative: siphash13 (file-level stable for identical content)
secretbase::siphash13(file=f1)
#> [1] "06df1cf71318cbac"
secretbase::siphash13(file=f2)
#> [1] "06df1cf71318cbac"

# Objects are identical
waldo::compare(qs::qread(f1), qs::qread(f2))
#> ✔ No differences
# ✔ No differences

Created on 2025-10-27 with reprex v2.1.1

(I also tested qs2::qs_save(..., shuffle = FALSE): object-level hashes match deterministically; file-byte digest can still differ, while siphash13 remains stable.)

Proposal (any of the following)

  • A. Allow custom hash function (e.g., options(pins.hash_fn = ...) or a board arg hash_fn=) so users can choose secretbase::siphash13 or an object-level hasher.
  • B. Built-in stable option: expose hash = c("digest", "siphash13", "object"); keep current default for compatibility.
  • C. Serializer profile for qs2 with deterministic settings (e.g., shuffle = FALSE), optionally combined with B.
  • D. Object-level hashing (deserialize then hash) as an opt-in for maximal reproducibility.

I think this suggestion will avoid spurious versions, storage churn, and non-reproducible pipelines when using pins as a content-addressed cache. Thank you so much for your support.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions