Add Bazel support for `--rewind_lost_inputs` by fmeum · Pull Request #25477 · bazelbuild/bazel

fmeum · 2025-03-05T17:26:40Z

Background

As of #25396, action rewinding (controlled by --rewind_lost_inputs) and build rewinding (controlled by --experimental_remote_cache_eviction_retries) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if --jobs=1, as discovered in #25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:

When a lost input is detected, the progress of actions running concurrently isn't lost.
Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

Changes

This PR adds Bazel support for --rewind_lost_inputs with arbitrary --jobs values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new RewoundActionSynchronizer interface to SkyframeActionExecutor that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (--remote_cache_async).

The synchronization scheme relies on a single ReadWriteLock that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in RemoteRewoundActionSynchronizer for details as well as a proof that this scheme is free of deadlocks.

Subsumes the previously reviewed #25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10):

bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled

Fixes #26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

fmeum · 2025-03-14T18:09:46Z

@justinhorvitz @coeuvre This is the first (and hopefully final) PR I have planned that is specific to action rewinding (not necessary for build rewinding) - it should be all that remains to get Bazel to support action rewinding. If time permits, it would be great if you could already take a first look and let me know whether it could have a chance to be accepted. If so, I would then work out the todos.

justinhorvitz · 2025-03-19T20:23:54Z

I'm very impressed that you figured out how to get the synchronization right. That alone is quite a feat. But I'm thinking about how we can avoid the complexity altogether:

Is there a way to translate a failure to read an input that was deleted into a lost input exception?
Why does rewinding work for blaze without this complexity? It's because our action filesystem has no dependency on reading from the output base. When it needs to open an input stream, it does so by reading the blob from remote storage.

Do either of those thoughts resonate with you?

fmeum · 2025-03-20T18:25:39Z

Before I worked on this PR, I actually tried to make the output handling of Bazel's spawn strategies atomic. This runs into a few Bazel-specific complexities:

Bazel supports BwoB with a remote or disk cache and local execution, which means that it's not only the remote strategy that can lose outputs, it's all the strategies. This includes standalone and unsandboxed worker execution, which execute arbitrary binaries directly on the real exec root. These Spawns may read truncated inputs, non-atomically modify their outputs, outright fail if their outputs exist before they run, etc.
On Windows, you can't delete a file that is currently open for reading, which makes surgically removing outputs while another action is consuming them even more challenging.

The explicit synchronization scheme also has an advantage compared to the Blaze approach in that it prevents "input tearing", that is, an action consuming outputs from multiple different (re-)executions of another action. This makes the effects of flakiness in builds even worse and Bazel builds already tend to be more flaky on average (simply because most companies don't have a "Bazel/Blaze team" and hermetic C++ toolchains are more difficult to procure).

If we wanted to avoid extra complexity, we could limit action rewinding to builds that only use sandboxed execution strategies. That would still require somewhat subtle work to make all these strategies atomic in how they handle their input, while not supporting the default Javac strategy (unsandboxed multiplex worker). It would also mean that we can't enable --rewind_lost_inputs by default, which is unfortunate for a flag that has a quite substantial but also pretty hidden usability impact.

I'm happy to provide more context on Bazel use cases and challenges and am also very much open to other approaches - this is just the best I could manage so far after weighing these pros and cons.

justinhorvitz · 2025-03-26T02:39:16Z

If we wanted to avoid extra complexity, we could limit action rewinding to builds that only use sandboxed execution strategies

We're actively looking to enable rewinding for internal builds that use mixed execution strategies (remote, persistent worker, local). @ericfelly is thinking about some sort of local storage for inputs that is separate from the output tree to avoid the deletion race concerns.

fmeum · 2025-03-26T14:43:53Z

We're actively looking to enable rewinding for internal builds that use mixed execution strategies (remote, persistent worker, local). @ericfelly is thinking about some sort of local storage for inputs that is separate from the output tree to avoid the deletion race concerns.

This sounds very interesting. Can you share more details about the use case and/or approach?

More specifically, are you planning to support:

local executions that discover lost inputs (so that they trigger rewinding of the actions that produce these inputs, but the rewound actions are still all assumed to be remote)
local executions whose outputs are lost (so that they may be rewound)

justinhorvitz · 2025-04-01T02:02:40Z

Outputs of local actions should never be lost, so we're only planning a solution for the former.

I'm not going to dive into reviewing this PR any further right now unless some bazel stakeholders decide that this is the direction they want to go. Just for reference, my team's priorities are to support google-internal use cases, and I try my best to review PRs on a best-effort basis. In this case, I would want someone more bazel-oriented to make a decision on the direction.

fmeum · 2025-05-21T13:18:49Z

src/main/java/com/google/devtools/build/lib/remote/RemoteOutputService.java

+   */
+  final class RemoteRewoundActionSynchronizer implements RewoundActionSynchronizer {
+    // A single coarse lock is used to synchronize rewound actions (writers) and both rewound and
+    // non-rewound actions (readers) as long as no rewound action has attempted to prepare for its


@coeuvre Regarding the optimization potential we discussed privately: If we can be sure that a certain action writes its outputs atomically and doesn't need them to be deleted (say, if we know the action runs with remote execution, where we can arrange for this), it would not need to acquire the write lock. If all actions are of this type (as they would be at Google), the fine-grained lock would never be inflated.

This enables future work on making action rewinding work in Bazel with `--jobs` values larger than 1, which is much more challenging for standalone execution due to actions directly operating on the exec root.

coeuvre

LGTM. ~~Please rebase and I will do the import.~~ Importing.

fmeum · 2026-03-09T10:16:13Z

@coeuvre Thanks! I updated the PR description to include a RELNOTES line, could you update it in your imported CL?

fmeum · 2026-03-09T10:24:35Z

@bazel-io fork 9.1.0

fmeum · 2026-03-09T10:24:42Z

@bazel-io fork 8.7.0

coeuvre · 2026-03-09T10:54:49Z

Import done. Fixed all internal tests. Sent out for internal review.

src/main/java/com/google/devtools/build/lib/remote/RemoteRewoundActionSynchronizer.java

As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes bazelbuild#26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes bazelbuild#25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)

As of #25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in #25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed #25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes #26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes #25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)

fmeum force-pushed the action-rewinding-simple-concurrency branch 5 times, most recently from 12473dc to be8fae5 Compare March 10, 2025 08:30

fmeum force-pushed the action-rewinding-simple-concurrency branch 7 times, most recently from 2ce580a to c11ea62 Compare March 14, 2025 14:41

fmeum changed the title ~~Fix action rewinding races with Bazel's filesystems~~ Add support for --rewind_lost_inputs to Bazel Mar 14, 2025

fmeum changed the title ~~Add support for --rewind_lost_inputs to Bazel~~ Add Bazel support for --rewind_lost_inputs Mar 14, 2025

fmeum requested review from coeuvre and justinhorvitz March 14, 2025 18:10

fmeum force-pushed the action-rewinding-simple-concurrency branch from 4b2906c to dc02bd8 Compare March 18, 2025 17:40

fmeum force-pushed the action-rewinding-simple-concurrency branch 3 times, most recently from 6cfe0cc to 7f78996 Compare May 20, 2025 16:13

fmeum commented May 21, 2025

View reviewed changes

meisterT requested a review from ericfelly October 16, 2025 07:15

fmeum force-pushed the action-rewinding-simple-concurrency branch 2 times, most recently from acc911a to 90740d1 Compare December 14, 2025 11:13

fmeum requested a review from coeuvre March 6, 2026 16:15

fmeum added 6 commits March 8, 2026 11:12

Run Bazel's RewindingTest with remote execution

88d3582

This enables future work on making action rewinding work in Bazel with `--jobs` values larger than 1, which is much more challenging for standalone execution due to actions directly operating on the exec root.

Support action rewinding in Bazel

ec726d1

Address comment by moving classes around

28ba587

Replace custom ReentrantReadWriteLock with StampedLock

18bba2e

Add comments

4533b36

Rebase

6eaed0f

fmeum force-pushed the action-rewinding-simple-concurrency branch from b091cb0 to 6eaed0f Compare March 8, 2026 14:43

coeuvre approved these changes Mar 9, 2026

View reviewed changes

This was referenced Mar 9, 2026

[9.1.0] Add Bazel support for --rewind_lost_inputs #28926

Closed

[8.7.0] Add Bazel support for --rewind_lost_inputs #28927

Open

tjgq requested changes Mar 11, 2026

View reviewed changes

src/main/java/com/google/devtools/build/lib/remote/RemoteRewoundActionSynchronizer.java Show resolved Hide resolved

copybara-service bot closed this in 464eacb Mar 11, 2026

github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Mar 11, 2026

fmeum deleted the action-rewinding-simple-concurrency branch March 11, 2026 19:44

fmeum mentioned this pull request Mar 11, 2026

[9.1.0] Add Bazel support for --rewind_lost_inputs #28958

Merged

fmeum mentioned this pull request Mar 12, 2026

[8.7.0] Add Bazel support for --rewind_lost_inputs #28971

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bazel support for `--rewind_lost_inputs`#25477

Add Bazel support for `--rewind_lost_inputs`#25477
fmeum wants to merge 6 commits intobazelbuild:masterfrom
fmeum:action-rewinding-simple-concurrency

fmeum commented Mar 5, 2025 •

edited

Loading

Uh oh!

fmeum commented Mar 14, 2025 •

edited

Loading

Uh oh!

justinhorvitz commented Mar 19, 2025

Uh oh!

fmeum commented Mar 20, 2025

Uh oh!

justinhorvitz commented Mar 26, 2025

Uh oh!

fmeum commented Mar 26, 2025

Uh oh!

justinhorvitz commented Apr 1, 2025

Uh oh!

fmeum May 21, 2025 •

edited

Loading

Uh oh!

coeuvre left a comment •

edited

Loading

Uh oh!

fmeum commented Mar 9, 2026

Uh oh!

fmeum commented Mar 9, 2026

Uh oh!

fmeum commented Mar 9, 2026

Uh oh!

coeuvre commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fmeum commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Changes

Uh oh!

fmeum commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justinhorvitz commented Mar 19, 2025

Uh oh!

fmeum commented Mar 20, 2025

Uh oh!

justinhorvitz commented Mar 26, 2025

Uh oh!

fmeum commented Mar 26, 2025

Uh oh!

justinhorvitz commented Apr 1, 2025

Uh oh!

fmeum May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coeuvre left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fmeum commented Mar 9, 2026

Uh oh!

fmeum commented Mar 9, 2026

Uh oh!

fmeum commented Mar 9, 2026

Uh oh!

coeuvre commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fmeum commented Mar 5, 2025 •

edited

Loading

fmeum commented Mar 14, 2025 •

edited

Loading

fmeum May 21, 2025 •

edited

Loading

coeuvre left a comment •

edited

Loading