Skip to content

Commit 043100c

Browse files
committed
Add Bazel support for --rewind_lost_inputs (bazelbuild#25477)
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes bazelbuild#26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes bazelbuild#25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)
1 parent 8e677ae commit 043100c

18 files changed

+646
-146
lines changed

src/main/java/com/google/devtools/build/lib/remote/AbstractActionInputPrefetcher.java

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ public abstract class AbstractActionInputPrefetcher implements ActionInputPrefet
7878
protected final Path execRoot;
7979
protected final RemoteOutputChecker remoteOutputChecker;
8080

81-
@Nullable private final ActionOutputDirectoryHelper outputDirectoryHelper;
81+
@Nullable protected final ActionOutputDirectoryHelper outputDirectoryHelper;
8282

8383
/** The state of a directory tracked by {@link DirectoryTracker}, as explained below. */
8484
enum DirectoryState {
@@ -140,8 +140,9 @@ private void setWritable(Path dir, DirectoryState newState) throws IOException {
140140
directoryStateMap.compute(
141141
dir,
142142
(unusedKey, oldState) -> {
143-
if (oldState == DirectoryState.TEMPORARILY_WRITABLE
144-
|| oldState == DirectoryState.PERMANENTLY_WRITABLE) {
143+
if (!forceRefetch(dir)
144+
&& (oldState == DirectoryState.TEMPORARILY_WRITABLE
145+
|| oldState == DirectoryState.PERMANENTLY_WRITABLE)) {
145146
// Already writable, but must potentially upgrade from temporary to permanent.
146147
return newState == DirectoryState.PERMANENTLY_WRITABLE ? newState : oldState;
147148
}
@@ -177,8 +178,9 @@ void setOutputPermissions(Path dir) throws IOException {
177178
directoryStateMap.compute(
178179
dir,
179180
(unusedKey, oldState) -> {
180-
if (oldState == DirectoryState.OUTPUT_PERMISSIONS
181-
|| oldState == DirectoryState.PERMANENTLY_WRITABLE) {
181+
if (!forceRefetch(dir)
182+
&& (oldState == DirectoryState.OUTPUT_PERMISSIONS
183+
|| oldState == DirectoryState.PERMANENTLY_WRITABLE)) {
182184
// Either the output permissions have already been set, or we're not changing the
183185
// permissions ever again.
184186
return oldState;
@@ -256,6 +258,12 @@ private static boolean shouldDownloadFile(Path path, FileArtifactValue metadata)
256258

257259
protected abstract boolean canDownloadFile(Path path, FileArtifactValue metadata);
258260

261+
/**
262+
* If true, then all previously acquired knowledge of the file system state of this path (e.g. the
263+
* existence of tree artifact directories or previously downloaded files) must be discarded.
264+
*/
265+
protected abstract boolean forceRefetch(Path path);
266+
259267
/**
260268
* Downloads file to the given path via its metadata.
261269
*
@@ -605,15 +613,16 @@ private Completable downloadFileNoCheckRx(
605613
alreadyDeleted.set(true);
606614
}));
607615

608-
return downloadCache.executeIfNot(
616+
return downloadCache.execute(
609617
finalPath,
610618
Completable.defer(
611619
() -> {
612620
if (shouldDownloadFile(finalPath, metadata)) {
613621
return download;
614622
}
615623
return Completable.complete();
616-
}));
624+
}),
625+
forceRefetch(finalPath));
617626
}
618627

619628
private void finalizeDownload(
@@ -690,7 +699,7 @@ private static void deletePartialDownload(Path path) {
690699
}
691700

692701
private Completable plantSymlink(Symlink symlink) {
693-
return downloadCache.executeIfNot(
702+
return downloadCache.execute(
694703
symlink.linkPath(),
695704
Completable.defer(
696705
() -> {
@@ -699,7 +708,8 @@ private Completable plantSymlink(Symlink symlink) {
699708
symlink.linkPath().delete();
700709
symlink.linkPath().createSymbolicLink(symlink.targetPath());
701710
return Completable.complete();
702-
}));
711+
}),
712+
forceRefetch(symlink.linkPath));
703713
}
704714

705715
public ImmutableSet<Path> downloadedFiles() {
@@ -736,7 +746,7 @@ public void finalizeAction(Action action, OutputMetadataStore outputMetadataStor
736746
}
737747

738748
var metadata = outputMetadataStore.getOutputMetadata(output);
739-
if (!metadata.isRemote()) {
749+
if (!canDownloadFile(output.getPath(), metadata)) {
740750
continue;
741751
}
742752

src/main/java/com/google/devtools/build/lib/remote/BUILD

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,7 @@ java_library(
212212
name = "remote_output_checker",
213213
srcs = ["RemoteOutputChecker.java"],
214214
deps = [
215+
":concurrent_artifact_path_trie",
215216
"//src/main/java/com/google/devtools/build/lib/actions:artifacts",
216217
"//src/main/java/com/google/devtools/build/lib/actions:file_metadata",
217218
"//src/main/java/com/google/devtools/build/lib/analysis:analysis_cluster",
@@ -283,6 +284,32 @@ java_library(
283284
],
284285
)
285286

287+
java_library(
288+
name = "concurrent_artifact_path_trie",
289+
srcs = ["ConcurrentArtifactPathTrie.java"],
290+
deps = [
291+
"//src/main/java/com/google/devtools/build/lib/actions:artifacts",
292+
"//src/main/java/com/google/devtools/build/lib/vfs:pathfragment",
293+
"//third_party:guava",
294+
],
295+
)
296+
297+
java_library(
298+
name = "remote_rewound_action_synchronizer",
299+
srcs = ["RemoteRewoundActionSynchronizer.java"],
300+
deps = [
301+
":remote_action_input_fetcher",
302+
"//src/main/java/com/google/devtools/build/lib/actions",
303+
"//src/main/java/com/google/devtools/build/lib/actions:action_lookup_data",
304+
"//src/main/java/com/google/devtools/build/lib/actions:artifacts",
305+
"//src/main/java/com/google/devtools/build/lib/profiler",
306+
"//src/main/java/com/google/devtools/build/lib/vfs:output_service",
307+
"//third_party:caffeine",
308+
"//third_party:guava",
309+
"//third_party:jsr305",
310+
],
311+
)
312+
286313
java_library(
287314
name = "store",
288315
srcs = ["Store.java"],
@@ -293,13 +320,16 @@ java_library(
293320
srcs = ["RemoteImportantOutputHandler.java"],
294321
deps = [
295322
":remote_output_checker",
323+
":remote_rewound_action_synchronizer",
296324
"//src/main/java/com/google/devtools/build/lib/actions",
297325
"//src/main/java/com/google/devtools/build/lib/actions:artifacts",
298326
"//src/main/java/com/google/devtools/build/lib/actions:file_metadata",
299327
"//src/main/java/com/google/devtools/build/lib/actions:important_output_handler",
328+
"//src/main/java/com/google/devtools/build/lib/profiler",
300329
"//src/main/java/com/google/devtools/build/lib/remote/common:bulk_transfer_exception",
301330
"//src/main/java/com/google/devtools/build/lib/remote/util",
302331
"//src/main/java/com/google/devtools/build/lib/vfs",
332+
"//src/main/java/com/google/devtools/build/lib/vfs:output_service",
303333
"//src/main/java/com/google/devtools/build/lib/vfs:pathfragment",
304334
"//src/main/java/com/google/devtools/build/skyframe",
305335
"//src/main/protobuf:failure_details_java_proto",
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
// Copyright 2026 The Bazel Authors. All rights reserved.
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
package com.google.devtools.build.lib.remote;
15+
16+
import com.google.common.base.Preconditions;
17+
import com.google.devtools.build.lib.actions.ActionInput;
18+
import com.google.devtools.build.lib.actions.Artifact;
19+
import com.google.devtools.build.lib.vfs.PathFragment;
20+
import java.util.concurrent.ConcurrentSkipListSet;
21+
22+
/**
23+
* A specialized concurrent trie that stores paths of artifacts and allows checking whether a given
24+
* path is contained in (in the case of a tree artifact) or exactly matches (in any other case) an
25+
* artifact in the trie.
26+
*/
27+
final class ConcurrentArtifactPathTrie {
28+
// Invariant: no path in this set is a prefix of another path.
29+
private final ConcurrentSkipListSet<PathFragment> paths =
30+
new ConcurrentSkipListSet<>(PathFragment.HIERARCHICAL_COMPARATOR);
31+
32+
/**
33+
* Adds the given {@link ActionInput} to the trie.
34+
*
35+
* <p>The caller must ensure that no object's path passed to this method is a prefix of any
36+
* previously added object's path. Bazel enforces this for non-aggregate artifacts. Callers must
37+
* not pass in {@link Artifact.TreeFileArtifact}s (which have exec paths that have their parent
38+
* tree artifact's exec path as a prefix) or non-Artifact {@link ActionInput}s that violate this
39+
* invariant.
40+
*/
41+
void add(ActionInput input) {
42+
Preconditions.checkArgument(
43+
!(input instanceof Artifact.TreeFileArtifact),
44+
"TreeFileArtifacts should not be added to the trie: %s",
45+
input);
46+
paths.add(input.getExecPath());
47+
}
48+
49+
/** Checks whether the given {@link PathFragment} is contained in an artifact in the trie. */
50+
boolean contains(PathFragment execPath) {
51+
// By the invariant of this set, there is at most one prefix of execPath in the set. Since the
52+
// comparator sorts all children of a path right after the path itself, if such a prefix
53+
// exists, it must thus sort right before execPath (or be equal to it).
54+
var floorPath = paths.floor(execPath);
55+
return floorPath != null && execPath.startsWith(floorPath);
56+
}
57+
}

src/main/java/com/google/devtools/build/lib/remote/RemoteActionInputFetcher.java

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@
1313
// limitations under the License.
1414
package com.google.devtools.build.lib.remote;
1515

16-
import static com.google.common.base.Preconditions.checkArgument;
1716
import static com.google.common.util.concurrent.Futures.immediateFailedFuture;
1817
import static com.google.common.util.concurrent.MoreExecutors.directExecutor;
1918

@@ -25,6 +24,7 @@
2524
import com.google.devtools.build.lib.actions.ActionExecutionMetadata;
2625
import com.google.devtools.build.lib.actions.ActionInput;
2726
import com.google.devtools.build.lib.actions.ActionOutputDirectoryHelper;
27+
import com.google.devtools.build.lib.actions.Artifact;
2828
import com.google.devtools.build.lib.actions.FileArtifactValue;
2929
import com.google.devtools.build.lib.actions.cache.VirtualActionInput;
3030
import com.google.devtools.build.lib.events.Reporter;
@@ -35,7 +35,9 @@
3535
import com.google.devtools.build.lib.util.TempPathGenerator;
3636
import com.google.devtools.build.lib.vfs.OutputPermissions;
3737
import com.google.devtools.build.lib.vfs.Path;
38+
import com.google.devtools.build.lib.vfs.Symlinks;
3839
import java.io.IOException;
40+
import java.util.Collection;
3941
import javax.annotation.Nullable;
4042

4143
/**
@@ -50,6 +52,7 @@ public class RemoteActionInputFetcher extends AbstractActionInputPrefetcher {
5052
private final String buildRequestId;
5153
private final String commandId;
5254
private final CombinedCache combinedCache;
55+
private final ConcurrentArtifactPathTrie rewoundActionOutputs = new ConcurrentArtifactPathTrie();
5356

5457
RemoteActionInputFetcher(
5558
Reporter reporter,
@@ -80,7 +83,18 @@ protected void prefetchVirtualActionInput(VirtualActionInput input) throws IOExc
8083

8184
@Override
8285
protected boolean canDownloadFile(Path path, FileArtifactValue metadata) {
83-
return metadata.isRemote();
86+
// When action rewinding is enabled, an action that had remote metadata at some point during the
87+
// build may have been re-executed locally to regenerate lost inputs, but may then be rewound
88+
// again and thus have its (now local) outputs deleted. In this case, we need to download the
89+
// outputs again, even if they are now considered local.
90+
return metadata.isRemote() || (forceRefetch(path) && !path.exists(Symlinks.NOFOLLOW));
91+
}
92+
93+
@Override
94+
protected boolean forceRefetch(Path path) {
95+
// Caches for download operations and output directory creation need to be disregarded for the
96+
// outputs of rewound actions as they may have been deleted after they were first created.
97+
return path.startsWith(execRoot) && rewoundActionOutputs.contains(path.relativeTo(execRoot));
8498
}
8599

86100
@Override
@@ -93,7 +107,6 @@ protected ListenableFuture<Void> doDownloadFile(
93107
Priority priority,
94108
Reason reason)
95109
throws IOException {
96-
checkArgument(metadata.isRemote(), "Cannot download file that is not a remote file.");
97110
RequestMetadata requestMetadata =
98111
TracingMetadataUtils.buildMetadata(
99112
buildRequestId,
@@ -140,4 +153,22 @@ protected ListenableFuture<Void> doDownloadFile(
140153
}),
141154
directExecutor());
142155
}
156+
157+
public void handleRewoundActionOutputs(Collection<Artifact> outputs) {
158+
// SkyframeActionExecutor#prepareForRewinding does *not* call this method because the
159+
// RemoteActionFileSystem corresponds to an ActionFileSystemType with inMemoryFileSystem() ==
160+
// true. While it is true that resetting outputDirectoryHelper isn't necessary to undo the
161+
// caching of output directory creation during action preparation, we still need to reset here
162+
// since outputDirectoryHelper is also used by AbstractActionInputPrefetcher.
163+
outputDirectoryHelper.invalidateTreeArtifactDirectoryCreation(outputs);
164+
for (Artifact output : outputs) {
165+
// Action templates have TreeFileArtifacts as outputs, which isn't supported by the trie. We
166+
// only need to track the tree artifacts themselves.
167+
if (output instanceof Artifact.TreeFileArtifact) {
168+
rewoundActionOutputs.add(output.getParent());
169+
} else {
170+
rewoundActionOutputs.add(output);
171+
}
172+
}
173+
}
143174
}

src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionService.java

Lines changed: 32 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,8 @@
6464
import com.google.common.eventbus.Subscribe;
6565
import com.google.common.util.concurrent.Futures;
6666
import com.google.common.util.concurrent.ListenableFuture;
67+
import com.google.common.util.concurrent.ListeningExecutorService;
68+
import com.google.common.util.concurrent.MoreExecutors;
6769
import com.google.common.util.concurrent.SettableFuture;
6870
import com.google.devtools.build.lib.actions.ActionInput;
6971
import com.google.devtools.build.lib.actions.Artifact;
@@ -151,8 +153,8 @@
151153
import java.util.Set;
152154
import java.util.TreeSet;
153155
import java.util.concurrent.CancellationException;
156+
import java.util.concurrent.CountDownLatch;
154157
import java.util.concurrent.ExecutionException;
155-
import java.util.concurrent.ExecutorService;
156158
import java.util.concurrent.Executors;
157159
import java.util.concurrent.Phaser;
158160
import java.util.concurrent.Semaphore;
@@ -192,9 +194,10 @@ public class RemoteExecutionService {
192194
private final Set<String> reportedErrors = new HashSet<>();
193195

194196
@SuppressWarnings("AllowVirtualThreads")
195-
private final ExecutorService backgroundTaskExecutor =
196-
Executors.newThreadPerTaskExecutor(
197-
Thread.ofVirtual().name("remote-execution-bg-", 0).factory());
197+
private final ListeningExecutorService backgroundTaskExecutor =
198+
MoreExecutors.listeningDecorator(
199+
Executors.newThreadPerTaskExecutor(
200+
Thread.ofVirtual().name("remote-execution-bg-", 0).factory()));
198201

199202
private final AtomicBoolean shutdown = new AtomicBoolean(false);
200203
private final AtomicBoolean buildInterrupted = new AtomicBoolean(false);
@@ -1761,16 +1764,31 @@ public void uploadOutputs(
17611764

17621765
if (remoteOptions.remoteCacheAsync
17631766
&& !action.getSpawn().getResourceOwner().mayModifySpawnOutputsAfterExecution()) {
1764-
backgroundTaskExecutor.execute(
1765-
() -> {
1766-
try {
1767-
doUploadOutputs(action, spawnResult, onUploadComplete);
1768-
} catch (ExecException e) {
1769-
reportUploadError(e);
1770-
} catch (InterruptedException ignored) {
1771-
// ThreadPerTaskExecutor does not care about interrupt status.
1772-
}
1773-
});
1767+
var uploadDone = new CountDownLatch(1);
1768+
var future =
1769+
backgroundTaskExecutor.submit(
1770+
() -> {
1771+
try {
1772+
doUploadOutputs(action, spawnResult, onUploadComplete);
1773+
} catch (ExecException e) {
1774+
reportUploadError(e);
1775+
} catch (InterruptedException ignored) {
1776+
// ThreadPerTaskExecutor does not care about interrupt status.
1777+
} finally {
1778+
uploadDone.countDown();
1779+
}
1780+
});
1781+
1782+
if (outputService instanceof RemoteOutputService remoteOutputService
1783+
&& remoteOutputService.getRewoundActionSynchronizer()
1784+
instanceof RemoteRewoundActionSynchronizer remoteRewoundActionSynchronizer) {
1785+
remoteRewoundActionSynchronizer.registerOutputUploadTask(
1786+
action.getRemoteActionExecutionContext().getSpawnOwner(),
1787+
() -> {
1788+
future.cancel(true);
1789+
uploadDone.await();
1790+
});
1791+
}
17741792
} else {
17751793
doUploadOutputs(action, spawnResult, onUploadComplete);
17761794
}

0 commit comments

Comments
 (0)