Skip to content

Commit bd45e3c

Browse files
committed
[8.7.0] Add Bazel support for --rewind_lost_inputs (bazelbuild#25477)
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes bazelbuild#26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes bazelbuild#25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)
1 parent 2a41e8a commit bd45e3c

16 files changed

+613
-145
lines changed

src/main/java/com/google/devtools/build/lib/remote/AbstractActionInputPrefetcher.java

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ public abstract class AbstractActionInputPrefetcher implements ActionInputPrefet
8383
protected final Path execRoot;
8484
protected final RemoteOutputChecker remoteOutputChecker;
8585

86-
private final ActionOutputDirectoryHelper outputDirectoryHelper;
86+
protected final ActionOutputDirectoryHelper outputDirectoryHelper;
8787

8888
/** The state of a directory tracked by {@link DirectoryTracker}, as explained below. */
8989
enum DirectoryState {
@@ -129,8 +129,9 @@ private void setWritable(Path dir, DirectoryState newState) throws IOException {
129129
directoryStateMap.compute(
130130
dir,
131131
(unusedKey, oldState) -> {
132-
if (oldState == DirectoryState.TEMPORARILY_WRITABLE
133-
|| oldState == DirectoryState.PERMANENTLY_WRITABLE) {
132+
if (!forceRefetch(dir)
133+
&& (oldState == DirectoryState.TEMPORARILY_WRITABLE
134+
|| oldState == DirectoryState.PERMANENTLY_WRITABLE)) {
134135
// Already writable, but must potentially upgrade from temporary to permanent.
135136
return newState == DirectoryState.PERMANENTLY_WRITABLE ? newState : oldState;
136137
}
@@ -162,8 +163,9 @@ void setOutputPermissions(Path dir) throws IOException {
162163
directoryStateMap.compute(
163164
dir,
164165
(unusedKey, oldState) -> {
165-
if (oldState == DirectoryState.OUTPUT_PERMISSIONS
166-
|| oldState == DirectoryState.PERMANENTLY_WRITABLE) {
166+
if (!forceRefetch(dir)
167+
&& (oldState == DirectoryState.OUTPUT_PERMISSIONS
168+
|| oldState == DirectoryState.PERMANENTLY_WRITABLE)) {
167169
// Either the output permissions have already been set, or we're not changing the
168170
// permissions ever again.
169171
return oldState;
@@ -240,6 +242,12 @@ private boolean shouldDownloadFile(Path path, FileArtifactValue metadata)
240242

241243
protected abstract boolean canDownloadFile(Path path, FileArtifactValue metadata);
242244

245+
/**
246+
* If true, then all previously acquired knowledge of the file system state of this path (e.g. the
247+
* existence of tree artifact directories or previously downloaded files) must be discarded.
248+
*/
249+
protected abstract boolean forceRefetch(Path path);
250+
243251
/**
244252
* Downloads file to the given path via its metadata.
245253
*
@@ -565,15 +573,16 @@ private Completable downloadFileNoCheckRx(
565573
return Completable.error(new CacheNotFoundException(digest, execPath));
566574
}));
567575

568-
return downloadCache.executeIfNot(
576+
return downloadCache.execute(
569577
finalPath,
570578
Completable.defer(
571579
() -> {
572580
if (shouldDownloadFile(finalPath, metadata)) {
573581
return download;
574582
}
575583
return Completable.complete();
576-
}));
584+
}),
585+
forceRefetch(finalPath));
577586
}
578587

579588
private void finalizeDownload(
@@ -646,7 +655,7 @@ private void deletePartialDownload(Path path) {
646655
}
647656

648657
private Completable plantSymlink(Symlink symlink) {
649-
return downloadCache.executeIfNot(
658+
return downloadCache.execute(
650659
execRoot.getRelative(symlink.linkExecPath()),
651660
Completable.defer(
652661
() -> {
@@ -657,7 +666,8 @@ private Completable plantSymlink(Symlink symlink) {
657666
link.delete();
658667
link.createSymbolicLink(target);
659668
return Completable.complete();
660-
}));
669+
}),
670+
forceRefetch(execRoot.getRelative(symlink.linkExecPath())));
661671
}
662672

663673
public ImmutableSet<Path> downloadedFiles() {
@@ -694,7 +704,7 @@ public void finalizeAction(Action action, OutputMetadataStore outputMetadataStor
694704
}
695705

696706
var metadata = outputMetadataStore.getOutputMetadata(output);
697-
if (!metadata.isRemote()) {
707+
if (!canDownloadFile(output.getPath(), metadata)) {
698708
continue;
699709
}
700710

src/main/java/com/google/devtools/build/lib/remote/BUILD

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ java_library(
2828
srcs = glob(
2929
["*.java"],
3030
exclude = [
31+
"ConcurrentArtifactPathTrie.java",
3132
"ExecutionStatusException.java",
3233
"ReferenceCountedChannel.java",
3334
"ChannelConnectionWithServerCapabilitiesFactory.java",
@@ -51,6 +52,7 @@ java_library(
5152
":ReferenceCountedChannel",
5253
":Retrier",
5354
":abstract_action_input_prefetcher",
55+
":concurrent_artifact_path_trie",
5456
":lease_service",
5557
":remote_output_checker",
5658
":scrubber",
@@ -208,6 +210,7 @@ java_library(
208210
name = "remote_output_checker",
209211
srcs = ["RemoteOutputChecker.java"],
210212
deps = [
213+
":concurrent_artifact_path_trie",
211214
"//src/main/java/com/google/devtools/build/lib/actions:artifacts",
212215
"//src/main/java/com/google/devtools/build/lib/actions:file_metadata",
213216
"//src/main/java/com/google/devtools/build/lib/analysis:analysis_cluster",
@@ -281,6 +284,16 @@ java_library(
281284
],
282285
)
283286

287+
java_library(
288+
name = "concurrent_artifact_path_trie",
289+
srcs = ["ConcurrentArtifactPathTrie.java"],
290+
deps = [
291+
"//src/main/java/com/google/devtools/build/lib/actions:artifacts",
292+
"//src/main/java/com/google/devtools/build/lib/vfs:pathfragment",
293+
"//third_party:guava",
294+
],
295+
)
296+
284297
java_library(
285298
name = "store",
286299
srcs = ["Store.java"],
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
// Copyright 2026 The Bazel Authors. All rights reserved.
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
package com.google.devtools.build.lib.remote;
15+
16+
import com.google.common.base.Preconditions;
17+
import com.google.devtools.build.lib.actions.ActionInput;
18+
import com.google.devtools.build.lib.actions.Artifact;
19+
import com.google.devtools.build.lib.vfs.PathFragment;
20+
import java.util.concurrent.ConcurrentSkipListSet;
21+
22+
/**
23+
* A specialized concurrent trie that stores paths of artifacts and allows checking whether a given
24+
* path is contained in (in the case of a tree artifact) or exactly matches (in any other case) an
25+
* artifact in the trie.
26+
*/
27+
final class ConcurrentArtifactPathTrie {
28+
// Invariant: no path in this set is a prefix of another path.
29+
private final ConcurrentSkipListSet<PathFragment> paths =
30+
new ConcurrentSkipListSet<>(PathFragment.HIERARCHICAL_COMPARATOR);
31+
32+
/**
33+
* Adds the given {@link ActionInput} to the trie.
34+
*
35+
* <p>The caller must ensure that no object's path passed to this method is a prefix of any
36+
* previously added object's path. Bazel enforces this for non-aggregate artifacts. Callers must
37+
* not pass in {@link Artifact.TreeFileArtifact}s (which have exec paths that have their parent
38+
* tree artifact's exec path as a prefix) or non-Artifact {@link ActionInput}s that violate this
39+
* invariant.
40+
*/
41+
void add(ActionInput input) {
42+
Preconditions.checkArgument(
43+
!(input instanceof Artifact.TreeFileArtifact),
44+
"TreeFileArtifacts should not be added to the trie: %s",
45+
input);
46+
paths.add(input.getExecPath());
47+
}
48+
49+
/** Checks whether the given {@link PathFragment} is contained in an artifact in the trie. */
50+
boolean contains(PathFragment execPath) {
51+
// By the invariant of this set, there is at most one prefix of execPath in the set. Since the
52+
// comparator sorts all children of a path right after the path itself, if such a prefix
53+
// exists, it must thus sort right before execPath (or be equal to it).
54+
var floorPath = paths.floor(execPath);
55+
return floorPath != null && execPath.startsWith(floorPath);
56+
}
57+
}

src/main/java/com/google/devtools/build/lib/remote/RemoteActionInputFetcher.java

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,14 @@
1313
// limitations under the License.
1414
package com.google.devtools.build.lib.remote;
1515

16-
import static com.google.common.base.Preconditions.checkArgument;
1716

1817
import build.bazel.remote.execution.v2.Digest;
1918
import build.bazel.remote.execution.v2.RequestMetadata;
2019
import com.google.common.base.Preconditions;
2120
import com.google.common.util.concurrent.ListenableFuture;
2221
import com.google.devtools.build.lib.actions.ActionExecutionMetadata;
2322
import com.google.devtools.build.lib.actions.ActionOutputDirectoryHelper;
23+
import com.google.devtools.build.lib.actions.Artifact;
2424
import com.google.devtools.build.lib.actions.FileArtifactValue;
2525
import com.google.devtools.build.lib.actions.cache.VirtualActionInput;
2626
import com.google.devtools.build.lib.events.Reporter;
@@ -31,7 +31,10 @@
3131
import com.google.devtools.build.lib.vfs.OutputPermissions;
3232
import com.google.devtools.build.lib.vfs.Path;
3333
import com.google.devtools.build.lib.vfs.PathFragment;
34+
import com.google.devtools.build.lib.vfs.Symlinks;
3435
import java.io.IOException;
36+
import java.util.Collection;
37+
import javax.annotation.Nullable;
3538

3639
/**
3740
* Stages output files that are stored remotely to the local filesystem.
@@ -44,6 +47,7 @@ public class RemoteActionInputFetcher extends AbstractActionInputPrefetcher {
4447
private final String buildRequestId;
4548
private final String commandId;
4649
private final CombinedCache combinedCache;
50+
private final ConcurrentArtifactPathTrie rewoundActionOutputs = new ConcurrentArtifactPathTrie();
4751

4852
RemoteActionInputFetcher(
4953
Reporter reporter,
@@ -79,7 +83,18 @@ protected void prefetchVirtualActionInput(VirtualActionInput input) throws IOExc
7983

8084
@Override
8185
protected boolean canDownloadFile(Path path, FileArtifactValue metadata) {
82-
return metadata.isRemote();
86+
// When action rewinding is enabled, an action that had remote metadata at some point during the
87+
// build may have been re-executed locally to regenerate lost inputs, but may then be rewound
88+
// again and thus have its (now local) outputs deleted. In this case, we need to download the
89+
// outputs again, even if they are now considered local.
90+
return metadata.isRemote() || (forceRefetch(path) && !path.exists(Symlinks.NOFOLLOW));
91+
}
92+
93+
@Override
94+
protected boolean forceRefetch(Path path) {
95+
// Caches for download operations and output directory creation need to be disregarded for the
96+
// outputs of rewound actions as they may have been deleted after they were first created.
97+
return path.startsWith(execRoot) && rewoundActionOutputs.contains(path.relativeTo(execRoot));
8398
}
8499

85100
@Override
@@ -92,7 +107,6 @@ protected ListenableFuture<Void> doDownloadFile(
92107
Priority priority,
93108
Reason reason)
94109
throws IOException {
95-
checkArgument(metadata.isRemote(), "Cannot download file that is not a remote file.");
96110
RequestMetadata requestMetadata =
97111
TracingMetadataUtils.buildMetadata(
98112
buildRequestId,
@@ -117,4 +131,22 @@ protected ListenableFuture<Void> doDownloadFile(
117131
execPath.toString(),
118132
digest.getSizeBytes()));
119133
}
134+
135+
public void handleRewoundActionOutputs(Collection<Artifact> outputs) {
136+
// SkyframeActionExecutor#prepareForRewinding does *not* call this method because the
137+
// RemoteActionFileSystem corresponds to an ActionFileSystemType with inMemoryFileSystem() ==
138+
// true. While it is true that resetting outputDirectoryHelper isn't necessary to undo the
139+
// caching of output directory creation during action preparation, we still need to reset here
140+
// since outputDirectoryHelper is also used by AbstractActionInputPrefetcher.
141+
outputDirectoryHelper.invalidateTreeArtifactDirectoryCreation(outputs);
142+
for (Artifact output : outputs) {
143+
// Action templates have TreeFileArtifacts as outputs, which isn't supported by the trie. We
144+
// only need to track the tree artifacts themselves.
145+
if (output instanceof Artifact.TreeFileArtifact) {
146+
rewoundActionOutputs.add(output.getParent());
147+
} else {
148+
rewoundActionOutputs.add(output);
149+
}
150+
}
151+
}
120152
}

src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionService.java

Lines changed: 32 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,8 @@
6262
import com.google.common.eventbus.Subscribe;
6363
import com.google.common.util.concurrent.Futures;
6464
import com.google.common.util.concurrent.ListenableFuture;
65+
import com.google.common.util.concurrent.ListeningExecutorService;
66+
import com.google.common.util.concurrent.MoreExecutors;
6567
import com.google.common.util.concurrent.SettableFuture;
6668
import com.google.devtools.build.lib.actions.ActionInput;
6769
import com.google.devtools.build.lib.actions.Artifact;
@@ -156,8 +158,8 @@
156158
import java.util.concurrent.CompletableFuture;
157159
import java.util.concurrent.CompletionException;
158160
import java.util.concurrent.ConcurrentLinkedQueue;
161+
import java.util.concurrent.CountDownLatch;
159162
import java.util.concurrent.ExecutionException;
160-
import java.util.concurrent.ExecutorService;
161163
import java.util.concurrent.Executors;
162164
import java.util.concurrent.Phaser;
163165
import java.util.concurrent.Semaphore;
@@ -194,9 +196,10 @@ public class RemoteExecutionService {
194196
private final Set<String> reportedErrors = new HashSet<>();
195197

196198
@SuppressWarnings("AllowVirtualThreads")
197-
private final ExecutorService backgroundTaskExecutor =
198-
Executors.newThreadPerTaskExecutor(
199-
Thread.ofVirtual().name("remote-execution-bg-", 0).factory());
199+
private final ListeningExecutorService backgroundTaskExecutor =
200+
MoreExecutors.listeningDecorator(
201+
Executors.newThreadPerTaskExecutor(
202+
Thread.ofVirtual().name("remote-execution-bg-", 0).factory()));
200203

201204
private final AtomicBoolean shutdown = new AtomicBoolean(false);
202205
private final AtomicBoolean buildInterrupted = new AtomicBoolean(false);
@@ -1873,16 +1876,31 @@ public void uploadOutputs(
18731876

18741877
if (remoteOptions.remoteCacheAsync
18751878
&& !action.getSpawn().getResourceOwner().mayModifySpawnOutputsAfterExecution()) {
1876-
backgroundTaskExecutor.execute(
1877-
() -> {
1878-
try {
1879-
doUploadOutputs(action, spawnResult, onUploadComplete);
1880-
} catch (ExecException e) {
1881-
reportUploadError(e);
1882-
} catch (InterruptedException ignored) {
1883-
// ThreadPerTaskExecutor does not care about interrupt status.
1884-
}
1885-
});
1879+
var uploadDone = new CountDownLatch(1);
1880+
var future =
1881+
backgroundTaskExecutor.submit(
1882+
() -> {
1883+
try {
1884+
doUploadOutputs(action, spawnResult, onUploadComplete);
1885+
} catch (ExecException e) {
1886+
reportUploadError(e);
1887+
} catch (InterruptedException ignored) {
1888+
// ThreadPerTaskExecutor does not care about interrupt status.
1889+
} finally {
1890+
uploadDone.countDown();
1891+
}
1892+
});
1893+
1894+
if (outputService instanceof RemoteOutputService remoteOutputService
1895+
&& remoteOutputService.getRewoundActionSynchronizer()
1896+
instanceof RemoteRewoundActionSynchronizer remoteRewoundActionSynchronizer) {
1897+
remoteRewoundActionSynchronizer.registerOutputUploadTask(
1898+
action.getRemoteActionExecutionContext().getSpawnOwner(),
1899+
() -> {
1900+
future.cancel(true);
1901+
uploadDone.await();
1902+
});
1903+
}
18861904
} else {
18871905
doUploadOutputs(action, spawnResult, onUploadComplete);
18881906
}

src/main/java/com/google/devtools/build/lib/remote/RemoteModule.java

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -504,7 +504,10 @@ public void beforeCommand(CommandEnvironment env) throws AbruptExitException {
504504
bazelOutputServiceChannel,
505505
lastBuildId);
506506
} else {
507-
outputService = new RemoteOutputService(env.getDirectories());
507+
outputService =
508+
new RemoteOutputService(
509+
env.getDirectories(),
510+
buildRequestOptions != null && buildRequestOptions.rewindLostInputs);
508511
}
509512

510513
if ((enableHttpCache || enableDiskCache) && !enableGrpcCache) {

0 commit comments

Comments
 (0)