[8.7.0] Add Bazel support for --rewind_lost_inputs #28971
Open
fmeum wants to merge 3 commits intobazelbuild:release-8.7.0from
Open
[8.7.0] Add Bazel support for --rewind_lost_inputs #28971fmeum wants to merge 3 commits intobazelbuild:release-8.7.0from
--rewind_lost_inputs #28971fmeum wants to merge 3 commits intobazelbuild:release-8.7.0from
Conversation
f2ff9ba to
f3f28f2
Compare
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues: * When a lost input is detected, the progress of actions running concurrently isn't lost. * Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability. * Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely. * Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues. This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically. Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`). The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks. ________ Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization. Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`): ``` bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled ``` Fixes bazelbuild#26657 RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions. Closes bazelbuild#25477. PiperOrigin-RevId: 882050264 Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 (cherry picked from commit 464eacb)
This ensures that the journal file is not kept open after a build, which has been observed to cause `RewindingTest` to fail on Windows, which disallows deleting open files (in this case during test case cleanup). Since the journal is written out at most every 3 seconds for all current usages of the `PersistentMap` class, the overhead of the additional `open` is negligible. Also include small tweaks to `RewindingTest` so that it can be enabled on Windows. In particular, the order of `SymlinkAction` and `SourceManifestAction` being rewound doesn't seem to be fixed and can differ on Windows. Closes bazelbuild#28108. PiperOrigin-RevId: 855242645 Change-Id: Ic7434139b290c6b0e2061977f747e890c3c5ece6 (cherry picked from commit 2be693e)
f3f28f2 to
0c16de2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
As of #25396, action rewinding (controlled by
--rewind_lost_inputs) and build rewinding (controlled by--experimental_remote_cache_eviction_retries) are equally effective at recovering lost inputs.However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if
--jobs=1, as discovered in #25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:Changes
This PR adds Bazel support for
--rewind_lost_inputswith arbitrary--jobsvalues by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.Synchronization is achieved by adding try-with-resources scopes backed by a new
RewoundActionSynchronizerinterface toSkyframeActionExecutorthat wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (--remote_cache_async).The synchronization scheme relies on a single
ReadWriteLockthat is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment inRemoteRewoundActionSynchronizerfor details as well as a proof that this scheme is free of deadlocks.Subsumes the previously reviewed #25412, which couldn't be merged due to the lack of synchronization.
Tested for races manually by running the following command (also with
ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10):Fixes #26657
RELNOTES: Bazel now has experimental support for
--rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.Includes the following cherry-picked changes:
\nbefore asserting on content, which is necessary for tests to pass on Windows