Skip to content

Comments

restore mypy cache.db copy-back atomicity#23081

Open
cburroughs wants to merge 3 commits intopantsbuild:mainfrom
cburroughs:csb/more-atomic-power
Open

restore mypy cache.db copy-back atomicity#23081
cburroughs wants to merge 3 commits intopantsbuild:mainfrom
cburroughs:csb/more-atomic-power

Conversation

@cburroughs
Copy link
Contributor

@cburroughs cburroughs commented Feb 6, 2026

There is a long chain here where c45f855 introduced the cp-in/mv-out pattern, 86f17a3 added device id checking so that said operation was atomic, and then cf96911 simplified and improved portability. However, two issues are present or regressed after all that:

  • The point of cp+mv is that mv is only atomic on the same file system. So it needs to be "cp from fs A to fs B, then mv on B". This had it reversed: "cp from fs A(sandbox) to fs A(sandbox), then mv to B(named_cache)" which isn't atomic.
  • ln does not overwrite hardlinks, so ln "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB" could only succeed on the first run, which makes it a pretty niche optimization.

This resolves both issues by always trying ln(sandbox->named)+mv first, and then falling back to cp(sandbox->named)+mv.

A few notes on alternatives or constraints:

  • ln -f isn't atomic, so that won't help.
  • mktemp makes a file, so we can't use that for the hardlink path since there would already be a file in the way.
  • I think using the pid ($$) is fine since we are not adversarial and nothing portable is worth the extra complexity.

There is a long chain here where c45f855 introduced the
cp-in/mv-out pattern, 86f17a3 added device id checking so that said
operation was atomic, and then cf96911 simplified and improved
portability.  However, two issues are present or regressed
after all that:

 * The point of cp+mv is that mv is only atomic on the same file
 system.  So it needs to be "cp from fs A to fs B, then mv on B".  This
 had it reversed: "cp from fs A(sandbox) to fs A(sandbox), then mv to
 B(named_cache)" which isn't atomic.
 * ln does not overwrite hardlinks, so `ln "$SANDBOX_CACHE_DB"
 "$NAMED_CACHE_DB"` could only succeed on the first run, which makes
 it a pretty niche optimization.

This resolves both issues by always trying ln(sandbox->named)+mv first, and then
falling back to cp(sandbox->named)+mv.

A few notes on alternatives or constraints:
 * `ln -f` isn't atomic, so that won't help.
 * `mktemp` makes a file, so we can't use that for the hardlink path
 since there would already be a file in the way.
 * I think using the pid (`$$`) is fine since we are not adversarial
 and nothing protable is worth the extra complexity.
@cburroughs cburroughs marked this pull request as ready for review February 6, 2026 19:22
@cburroughs cburroughs self-assigned this Feb 6, 2026
if [ $EXIT_CODE -le 1 ]; then
if ! {ln.path} "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB" > /dev/null 2>&1; then
TMP_CACHE=$({mktemp.path} "$SANDBOX_CACHE_DB.tmp.XXXXXX")
if {ln.path} "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB.$$.tmp" > /dev/null 2>&1; then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mktemp makes a file, so we can't use that for the hardlink path since there would already be a file in the way.

What if you used

Suggested change
if {ln.path} "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB.$$.tmp" > /dev/null 2>&1; then
if {ln.path} "$SANDBOX_CACHE_DB" $({mktemp.path} -u "$NAMED_CACHE_DB.tmp.XXXXXX") > /dev/null 2>&1; then
‘-u’
‘--dry-run’
     Generate a temporary name that does not name an existing file,
     without changing the file system contents.

The mktemp docs say this is "unsafe", because "Using the output of this command to create a new file is inherently unsafe, as there is a window of time between generating the name and using it where another process can create an object by the same name. [...] the attacker can create an appropriately named symbolic link, such that when the script then opens a handle to what it thought was an unused file, it is instead modifying an existing file" But, that seems fine here for a few reasons:

  • ln will fail if another process creates that path first, so it won't modify any existing file
  • $$ suffers from the exact same "problem", so this is no worse

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any cases where the same pants process might try writing the mypy cache twice? If so, that'd be another reason not to use $$ for the hardlink.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think mktemp --dry-run would be fine. I thought having two tmp paths might be confusing and the uniqueness benefits for already unique at execution time PIDs was marginal, but could go either way. (The mv on the next line needs to know the value too, so it can't just be inline like in the literal suggestion.)

Are there any cases where the same pants process might try writing the mypy cache twice?

I think in this case the "process" in question would be the shell running this script (aka: __mypy_runner.sh).

Comment on lines 318 to 320
TMP_CACHE=$({mktemp.path} "$NAMED_CACHE_DB.tmp.XXXXXX")
{cp.path} "$SANDBOX_CACHE_DB" "$TMP_CACHE" > /dev/null 2>&1
{mv.path} "$TMP_CACHE" "$NAMED_CACHE_DB" > /dev/null 2>&1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like set -e is enabled. Should that be enabled; or should these three lines be joined by && operators? Otherwise $TMP_CACHE could be an empty string.

Example from the mktemp docs:

          $ file=$(mktemp -q) && {
          >   # Safe to use $file only within this block.
          >   echo ... > "$file"
          >   rm "$file"
          > }

Speaking of rm "$file", should there be cleanup of $TMP_CACHE if the mv fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I think that's right.

I was trying not to add even more conditionals. Will think about approaches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I hold off on reviewing?

Thanks for the comments @jmoldow !

@cburroughs
Copy link
Contributor Author

Noodled for a while, thanks for patience.

  • I found a way to do mktmp -u that I found satisfactory so I switched to that. I'm still doubtful this matters in practice vs $$ but there is no execution time downside and it is more consistent.
  • The chaining was a good catch and bug! Added.
  • This is a cache, so I'm okay with not adding complexity to rm a bad tmp file. This would be a lot more yarn to pull on, but I'd probably prefer much more verbose output before adding that.

@cburroughs cburroughs requested review from jmoldow and removed request for jmoldow February 18, 2026 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants