restore mypy cache.db copy-back atomicity#23081
restore mypy cache.db copy-back atomicity#23081cburroughs wants to merge 3 commits intopantsbuild:mainfrom
Conversation
There is a long chain here where c45f855 introduced the cp-in/mv-out pattern, 86f17a3 added device id checking so that said operation was atomic, and then cf96911 simplified and improved portability. However, two issues are present or regressed after all that: * The point of cp+mv is that mv is only atomic on the same file system. So it needs to be "cp from fs A to fs B, then mv on B". This had it reversed: "cp from fs A(sandbox) to fs A(sandbox), then mv to B(named_cache)" which isn't atomic. * ln does not overwrite hardlinks, so `ln "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB"` could only succeed on the first run, which makes it a pretty niche optimization. This resolves both issues by always trying ln(sandbox->named)+mv first, and then falling back to cp(sandbox->named)+mv. A few notes on alternatives or constraints: * `ln -f` isn't atomic, so that won't help. * `mktemp` makes a file, so we can't use that for the hardlink path since there would already be a file in the way. * I think using the pid (`$$`) is fine since we are not adversarial and nothing protable is worth the extra complexity.
| if [ $EXIT_CODE -le 1 ]; then | ||
| if ! {ln.path} "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB" > /dev/null 2>&1; then | ||
| TMP_CACHE=$({mktemp.path} "$SANDBOX_CACHE_DB.tmp.XXXXXX") | ||
| if {ln.path} "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB.$$.tmp" > /dev/null 2>&1; then |
There was a problem hiding this comment.
mktempmakes a file, so we can't use that for the hardlink path since there would already be a file in the way.
What if you used
| if {ln.path} "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB.$$.tmp" > /dev/null 2>&1; then | |
| if {ln.path} "$SANDBOX_CACHE_DB" $({mktemp.path} -u "$NAMED_CACHE_DB.tmp.XXXXXX") > /dev/null 2>&1; then |
‘-u’
‘--dry-run’
Generate a temporary name that does not name an existing file,
without changing the file system contents.
The mktemp docs say this is "unsafe", because "Using the output of this command to create a new file is inherently unsafe, as there is a window of time between generating the name and using it where another process can create an object by the same name. [...] the attacker can create an appropriately named symbolic link, such that when the script then opens a handle to what it thought was an unused file, it is instead modifying an existing file" But, that seems fine here for a few reasons:
lnwill fail if another process creates that path first, so it won't modify any existing file$$suffers from the exact same "problem", so this is no worse
There was a problem hiding this comment.
Are there any cases where the same pants process might try writing the mypy cache twice? If so, that'd be another reason not to use $$ for the hardlink.
There was a problem hiding this comment.
Yeah I think mktemp --dry-run would be fine. I thought having two tmp paths might be confusing and the uniqueness benefits for already unique at execution time PIDs was marginal, but could go either way. (The mv on the next line needs to know the value too, so it can't just be inline like in the literal suggestion.)
Are there any cases where the same pants process might try writing the mypy cache twice?
I think in this case the "process" in question would be the shell running this script (aka: __mypy_runner.sh).
| TMP_CACHE=$({mktemp.path} "$NAMED_CACHE_DB.tmp.XXXXXX") | ||
| {cp.path} "$SANDBOX_CACHE_DB" "$TMP_CACHE" > /dev/null 2>&1 | ||
| {mv.path} "$TMP_CACHE" "$NAMED_CACHE_DB" > /dev/null 2>&1 |
There was a problem hiding this comment.
It doesn't look like set -e is enabled. Should that be enabled; or should these three lines be joined by && operators? Otherwise $TMP_CACHE could be an empty string.
Example from the mktemp docs:
$ file=$(mktemp -q) && {
> # Safe to use $file only within this block.
> echo ... > "$file"
> rm "$file"
> }Speaking of rm "$file", should there be cleanup of $TMP_CACHE if the mv fails?
There was a problem hiding this comment.
hmmm, I think that's right.
I was trying not to add even more conditionals. Will think about approaches.
There was a problem hiding this comment.
Should I hold off on reviewing?
Thanks for the comments @jmoldow !
|
Noodled for a while, thanks for patience.
|
There is a long chain here where c45f855 introduced the cp-in/mv-out pattern, 86f17a3 added device id checking so that said operation was atomic, and then cf96911 simplified and improved portability. However, two issues are present or regressed after all that:
ln "$SANDBOX_CACHE_DB" "$NAMED_CACHE_DB"could only succeed on the first run, which makes it a pretty niche optimization.This resolves both issues by always trying ln(sandbox->named)+mv first, and then falling back to cp(sandbox->named)+mv.
A few notes on alternatives or constraints:
ln -fisn't atomic, so that won't help.mktempmakes a file, so we can't use that for the hardlink path since there would already be a file in the way.$$) is fine since we are not adversarial and nothing portable is worth the extra complexity.