Add Bazel support for --rewind_lost_inputs#25477
Add Bazel support for --rewind_lost_inputs#25477fmeum wants to merge 6 commits intobazelbuild:masterfrom
--rewind_lost_inputs#25477Conversation
12473dc to
be8fae5
Compare
2ce580a to
c11ea62
Compare
--rewind_lost_inputs to Bazel
--rewind_lost_inputs to Bazel--rewind_lost_inputs
|
@justinhorvitz @coeuvre This is the first (and hopefully final) PR I have planned that is specific to action rewinding (not necessary for build rewinding) - it should be all that remains to get Bazel to support action rewinding. If time permits, it would be great if you could already take a first look and let me know whether it could have a chance to be accepted. If so, I would then work out the todos. |
4b2906c to
dc02bd8
Compare
|
I'm very impressed that you figured out how to get the synchronization right. That alone is quite a feat. But I'm thinking about how we can avoid the complexity altogether:
Do either of those thoughts resonate with you? |
|
Before I worked on this PR, I actually tried to make the output handling of Bazel's spawn strategies atomic. This runs into a few Bazel-specific complexities:
The explicit synchronization scheme also has an advantage compared to the Blaze approach in that it prevents "input tearing", that is, an action consuming outputs from multiple different (re-)executions of another action. This makes the effects of flakiness in builds even worse and Bazel builds already tend to be more flaky on average (simply because most companies don't have a "Bazel/Blaze team" and hermetic C++ toolchains are more difficult to procure). If we wanted to avoid extra complexity, we could limit action rewinding to builds that only use sandboxed execution strategies. That would still require somewhat subtle work to make all these strategies atomic in how they handle their input, while not supporting the default Javac strategy (unsandboxed multiplex worker). It would also mean that we can't enable I'm happy to provide more context on Bazel use cases and challenges and am also very much open to other approaches - this is just the best I could manage so far after weighing these pros and cons. |
We're actively looking to enable rewinding for internal builds that use mixed execution strategies (remote, persistent worker, local). @ericfelly is thinking about some sort of local storage for inputs that is separate from the output tree to avoid the deletion race concerns. |
This sounds very interesting. Can you share more details about the use case and/or approach? More specifically, are you planning to support:
|
|
Outputs of local actions should never be lost, so we're only planning a solution for the former. I'm not going to dive into reviewing this PR any further right now unless some bazel stakeholders decide that this is the direction they want to go. Just for reference, my team's priorities are to support google-internal use cases, and I try my best to review PRs on a best-effort basis. In this case, I would want someone more bazel-oriented to make a decision on the direction. |
6cfe0cc to
7f78996
Compare
| */ | ||
| final class RemoteRewoundActionSynchronizer implements RewoundActionSynchronizer { | ||
| // A single coarse lock is used to synchronize rewound actions (writers) and both rewound and | ||
| // non-rewound actions (readers) as long as no rewound action has attempted to prepare for its |
There was a problem hiding this comment.
@coeuvre Regarding the optimization potential we discussed privately: If we can be sure that a certain action writes its outputs atomically and doesn't need them to be deleted (say, if we know the action runs with remote execution, where we can arrange for this), it would not need to acquire the write lock. If all actions are of this type (as they would be at Google), the fine-grained lock would never be inflated.
acc911a to
90740d1
Compare
This enables future work on making action rewinding work in Bazel with `--jobs` values larger than 1, which is much more challenging for standalone execution due to actions directly operating on the exec root.
b091cb0 to
6eaed0f
Compare
|
@coeuvre Thanks! I updated the PR description to include a RELNOTES line, could you update it in your imported CL? |
|
@bazel-io fork 9.1.0 |
|
@bazel-io fork 8.7.0 |
|
Import done. Fixed all internal tests. Sent out for internal review. |
src/main/java/com/google/devtools/build/lib/remote/RemoteRewoundActionSynchronizer.java
Show resolved
Hide resolved
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes bazelbuild#26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes bazelbuild#25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes bazelbuild#26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes bazelbuild#25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes bazelbuild#26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes bazelbuild#25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes bazelbuild#26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes bazelbuild#25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes bazelbuild#26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes bazelbuild#25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes bazelbuild#26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes bazelbuild#25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes bazelbuild#26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes bazelbuild#25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)
As of #25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in #25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed #25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes #26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes #25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)
Background
As of #25396, action rewinding (controlled by
--rewind_lost_inputs) and build rewinding (controlled by--experimental_remote_cache_eviction_retries) are equally effective at recovering lost inputs.However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if
--jobs=1, as discovered in #25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:Changes
This PR adds Bazel support for
--rewind_lost_inputswith arbitrary--jobsvalues by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.Synchronization is achieved by adding try-with-resources scopes backed by a new
RewoundActionSynchronizerinterface toSkyframeActionExecutorthat wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (--remote_cache_async).The synchronization scheme relies on a single
ReadWriteLockthat is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment inRemoteRewoundActionSynchronizerfor details as well as a proof that this scheme is free of deadlocks.Subsumes the previously reviewed #25412, which couldn't be merged due to the lack of synchronization.
Tested for races manually by running the following command (also with
ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10):Fixes #26657
RELNOTES: Bazel now has experimental support for
--rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.