restore mypy cache.db copy-back atomicity by cburroughs · Pull Request #23081 · pantsbuild/pants

cburroughs · 2026-02-06T18:01:22Z

There is a long chain here where c45f855 introduced the cp-in/mv-out pattern, 86f17a3 added device id checking so that said operation was atomic, and then cf96911 simplified and improved portability. However, two issues are present or regressed after all that:

The point of cp+mv is that mv is only atomic on the same file system. So it needs to be "cp from fs A to fs B, then mv on B". This had it reversed: "cp from fs A(sandbox) to fs A(sandbox), then mv to B(named_cache)" which isn't atomic.
ln does not overwrite hardlinks, so ln "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB" could only succeed on the first run, which makes it a pretty niche optimization.

This resolves both issues by always trying ln(sandbox->named)+mv first, and then falling back to cp(sandbox->named)+mv.

A few notes on alternatives or constraints:

ln -f isn't atomic, so that won't help.
mktemp makes a file, so we can't use that for the hardlink path since there would already be a file in the way.
I think using the pid ($$) is fine since we are not adversarial and nothing portable is worth the extra complexity.

There is a long chain here where c45f855 introduced the cp-in/mv-out pattern, 86f17a3 added device id checking so that said operation was atomic, and then cf96911 simplified and improved portability. However, two issues are present or regressed after all that: * The point of cp+mv is that mv is only atomic on the same file system. So it needs to be "cp from fs A to fs B, then mv on B". This had it reversed: "cp from fs A(sandbox) to fs A(sandbox), then mv to B(named_cache)" which isn't atomic. * ln does not overwrite hardlinks, so `ln "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB"` could only succeed on the first run, which makes it a pretty niche optimization. This resolves both issues by always trying ln(sandbox->named)+mv first, and then falling back to cp(sandbox->named)+mv. A few notes on alternatives or constraints: * `ln -f` isn't atomic, so that won't help. * `mktemp` makes a file, so we can't use that for the hardlink path since there would already be a file in the way. * I think using the pid (`$$`) is fine since we are not adversarial and nothing protable is worth the extra complexity.

jmoldow · 2026-02-06T19:22:07Z

src/python/pants/backend/python/typecheck/mypy/rules.py

                            if [ $EXIT_CODE -le 1 ]; then
-                                if ! {ln.path} "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB" > /dev/null 2>&1; then
-                                    TMP_CACHE=$({mktemp.path} "$SANDBOX_CACHE_DB.tmp.XXXXXX")
+                                if {ln.path} "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB.$$.tmp" > /dev/null 2>&1; then


mktemp makes a file, so we can't use that for the hardlink path since there would already be a file in the way.

What if you used

Suggested change

if {ln.path} "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB.$$.tmp" > /dev/null 2>&1; then

if {ln.path} "$SANDBOX_CACHE_DB" $({mktemp.path} -u "$NAMED_CACHE_DB.tmp.XXXXXX") > /dev/null 2>&1; then

‘-u’ ‘--dry-run’ Generate a temporary name that does not name an existing file, without changing the file system contents.

The mktemp docs say this is "unsafe", because "Using the output of this command to create a new file is inherently unsafe, as there is a window of time between generating the name and using it where another process can create an object by the same name. [...] the attacker can create an appropriately named symbolic link, such that when the script then opens a handle to what it thought was an unused file, it is instead modifying an existing file" But, that seems fine here for a few reasons:

ln will fail if another process creates that path first, so it won't modify any existing file

$$ suffers from the exact same "problem", so this is no worse

Are there any cases where the same pants process might try writing the mypy cache twice? If so, that'd be another reason not to use $$ for the hardlink.

Yeah I think mktemp --dry-run would be fine. I thought having two tmp paths might be confusing and the uniqueness benefits for already unique at execution time PIDs was marginal, but could go either way. (The mv on the next line needs to know the value too, so it can't just be inline like in the literal suggestion.)

Are there any cases where the same pants process might try writing the mypy cache twice?

I think in this case the "process" in question would be the shell running this script (aka: __mypy_runner.sh).

jmoldow · 2026-02-06T19:29:40Z

src/python/pants/backend/python/typecheck/mypy/rules.py

+                                    TMP_CACHE=$({mktemp.path} "$NAMED_CACHE_DB.tmp.XXXXXX")
                                    {cp.path} "$SANDBOX_CACHE_DB" "$TMP_CACHE" > /dev/null 2>&1
                                    {mv.path} "$TMP_CACHE" "$NAMED_CACHE_DB" > /dev/null 2>&1


It doesn't look like set -e is enabled. Should that be enabled; or should these three lines be joined by && operators? Otherwise $TMP_CACHE could be an empty string.

Example from the mktemp docs:

$ file=$(mktemp -q) && { > # Safe to use $file only within this block. > echo ... > "$file" > rm "$file" > }

Speaking of rm "$file", should there be cleanup of $TMP_CACHE if the mv fails?

hmmm, I think that's right.

I was trying not to add even more conditionals. Will think about approaches.

Should I hold off on reviewing?

Thanks for the comments @jmoldow !

cburroughs · 2026-02-14T03:07:38Z

Noodled for a while, thanks for patience.

I found a way to do mktmp -u that I found satisfactory so I switched to that. I'm still doubtful this matters in practice vs $$ but there is no execution time downside and it is more consistent.
The chaining was a good catch and bug! Added.
This is a cache, so I'm okay with not adding complexity to rm a bad tmp file. This would be a lot more yarn to pull on, but I'd probably prefer much more verbose output before adding that.

cburroughs marked this pull request as ready for review February 6, 2026 19:22

cburroughs requested review from benjyw, sureshjoshi and thejcannon February 6, 2026 19:22

cburroughs self-assigned this Feb 6, 2026

jmoldow reviewed Feb 6, 2026

View reviewed changes

cburroughs added 2 commits February 13, 2026 21:49

review feedback

b9f968b

Merge remote-tracking branch 'upstream/main' into csb/more-atomic-power

f25d239

cburroughs requested review from jmoldow and removed request for jmoldow February 18, 2026 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

restore mypy cache.db copy-back atomicity#23081

restore mypy cache.db copy-back atomicity#23081
cburroughs wants to merge 3 commits intopantsbuild:mainfrom
cburroughs:csb/more-atomic-power

cburroughs commented Feb 6, 2026 •

edited

Loading

Uh oh!

jmoldow Feb 6, 2026

Uh oh!

jmoldow Feb 6, 2026

Uh oh!

cburroughs Feb 7, 2026

Uh oh!

jmoldow Feb 6, 2026

Uh oh!

cburroughs Feb 7, 2026

Uh oh!

benjyw Feb 7, 2026

Uh oh!

cburroughs commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if {ln.path} "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB.$$.tmp" > /dev/null 2>&1; then
	if {ln.path} "$SANDBOX_CACHE_DB" $({mktemp.path} -u "$NAMED_CACHE_DB.tmp.XXXXXX") > /dev/null 2>&1; then

Uh oh!

Comments

Conversation

cburroughs commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmoldow Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

jmoldow Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

cburroughs Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

jmoldow Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

cburroughs Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

benjyw Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

cburroughs commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cburroughs commented Feb 6, 2026 •

edited

Loading