-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Move coroutine upvars into locals for better memory economy #135527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Move coroutine upvars into locals for better memory economy #135527
Conversation
Some changes occurred in compiler/rustc_codegen_cranelift cc @bjorn3 Some changes occurred to the CTFE / Miri interpreter cc @rust-lang/miri Some changes occurred to MIR optimizations cc @rust-lang/wg-mir-opt Some changes occurred to the CTFE machinery cc @rust-lang/wg-const-eval |
This comment has been minimized.
This comment has been minimized.
☔ The latest upstream changes (presumably #135715) made this pull request unmergeable. Please resolve the merge conflicts. |
I don't think this needs a reviewer? |
3e6a399
to
9603ad6
Compare
cc @Darksonn @tmandry @eholk @rust-lang/wg-async Ding here is reworking the layout of coroutines to try to reduce their memory footprint (and that of What do people think? |
This comment has been minimized.
This comment has been minimized.
For anyone searching for a description of what this PR changes, it's summarized at the top of compiler/rustc_mir_transform/src/coroutine/relocate_upvars.rs. |
//! The reason is that it is possible that coroutine layout may change and the source memory location of | ||
//! an upvar may not necessarily be mapped exactly to the same place as in the `Unresumed` state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we decide the offsets of upvars in Unresumed
in the same place as we decide the offset of saved locals? Couldn't we then "backpropagate" the field offsets for each upvar's local as the offset for the corresponding upvar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing! I had a backlog of things due to sickness.
True indeed. This statement is completely voided by the work in the second commit. I will reword this section in the following way.
By enabling the feature gate coroutine_new_layout
the field offsets of the upvars in Unresumed
state are further exactly placed in the same place as their corresponding saved locals, which is guaranteed by the alternative coroutine layout calculator that enters in effect. <... quote the relevant comment/file/etc. ...>
I don't personally have any means of performance testing this at the moment. It would be much easier if it landed behind a feature gate. |
☔ The latest upstream changes (presumably #135318) made this pull request unmergeable. Please resolve the merge conflicts. |
Cc @arielb1 who was also investigated this
…On Wed, Jan 29, 2025, at 7:56 PM, Tyler Mandry wrote:
***@***.**** commented on this pull request.
In compiler/rustc_mir_transform/src/coroutine/relocate_upvars.rs <#135527 (comment)>:
> +//! The reason is that it is possible that coroutine layout may change and the source memory location of
+//! an upvar may not necessarily be mapped exactly to the same place as in the `Unresumed` state.
Don't we decide the offsets of upvars in `Unresumed` in the same place as we decide the offset of saved locals? Couldn't we then "backpropagate" the field offsets for each upvar's local as the offset for the corresponding upvar?
—
Reply to this email directly, view it on GitHub <#135527 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABF4ZTFDPQDUNGH5L6MGSL2NF2CHAVCNFSM6AAAAABVG4UUZ2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKOBSGY4TKMZUHA>.
You are receiving this because you are on a team that was mentioned.Message ID: ***@***.***>
|
I think it is fair to land with a feature gate so that we can get to play with it. The PR has temporarily disabled the check on the feature gate. However, given that coroutine layout data is keyed individually by their |
9603ad6
to
3a1e04a
Compare
This comment has been minimized.
This comment has been minimized.
3a1e04a
to
61d4bbd
Compare
This comment has been minimized.
This comment has been minimized.
Would this be better as a |
Are there any issues if only one crate activates it but others do not? if there are no issues, a feature gate seems ok (and easier to use ^^) |
☔ The latest upstream changes (presumably #137030) made this pull request unmergeable. Please resolve the merge conflicts. |
A feature doesn't allow turning it on for the whole build, you'd have to fork every single crate that uses async. A -Z flag would be better IMO. |
Agreed on a If my understanding is correct, we shouldn't expect any regression from this approach (only upside), but since we currently rely on later passes eliding copies there might be some regression. We could be more aggressive in eliding the copies ourselves, but maybe this is hard. |
Thanks for looking into this! I will have time this week to clean this up a bit and I will ask rustbot to set it to ready-for-review. |
61d4bbd
to
0ff7e65
Compare
Some changes occurred in compiler/rustc_codegen_ssa |
@craterbot run p=1 mode=build-and-test crates=https://crater-reports.s3.amazonaws.com/pr-135527-3/retry-regressed-list.txt |
👌 Experiment ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more |
🚧 Experiment ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more |
🎉 Experiment
Footnotes
|
@craterbot run p=1 mode=build-and-test crates=https://crater-reports.s3.amazonaws.com/pr-135527-4/retry-regressed-list.txt |
👌 Experiment ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more |
🚧 Experiment ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more |
🎉 Experiment
Footnotes
|
I am checking some of the crates: their tests are somewhat flaky and they are not using |
... and treat coroutine upvar captures as saved locals as well. This allows the liveness analysis to determine which captures are truly saved across a yield point and which are initially used but discarded at first yield points. In the event that upvar captures are promoted, most certainly because a coroutine suspends at least once, the slots in the promotion prefix shall be reused. This means that the copies emitted in the upvar relocation MIR pass will eventually elided and eliminated in the codegen phase, hence no additional runtime cost is realised. Additional MIR dumps are inserted so that it is easier to inspect the bodies of async closures, including those that captures the state by-value. Debug information is updated to point at the correct location for upvars in borrow checking errors and final debuginfo. A language change that this patch enables is now actually reverted, so that lifetimes on relocated upvars are invariant with the upvars outside of the coroutine body. We are deferring the language change to a later discussion. Co-authored-by: Dario Nieuwenhuis <[email protected]> Signed-off-by: Xiangfei Ding <[email protected]>
073de09
to
f135989
Compare
This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
@rustbot ready
r? oli-obk Maybe cc @RalfJung as well. I am not sure if the changes to MIR would interfere with the aggressive optimisation project. Since the last review, the MIR transformation pass is completely in |
|
Tagging @oli-obk for review because I was advised that for MIR related changes I should best approach you as well. |
I don't know much about coroutine / closure lowering and don't have the time to dig into it right now. Apart from Oli, I am not sure who our main experts for that are (other than @eddyb who's not really around any more)... @compiler-errors @cjgillot @lcnr ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read everything yet. These are my first comments.
self.show_mutating_upvar(tcx, id.expect_local(), the_place_err, err); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this change land independently?
|
||
/// This map `A -> B` allows later MIR passes, error reporters | ||
/// and layout calculator to relate saved locals `A` sourced from upvars | ||
/// and locals `B` that upvars are moved into. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a bit the comment? What exactly is relocated_upvars[A]
?
storage_conflicts, | ||
relocated_upvars, | ||
pack: PackCoroutineLayout::CapturesOnly, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you simplify this function? You don't need to iterate over upvar_tys
or upvar_saved_locals
when there is only one entry.
#[inline] | ||
fn prefix_tys(self) -> &'tcx List<Ty<'tcx>> { | ||
self.upvar_tys() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ease review and git blame
, do you mind splitting this change into its own commit?
if !self.0 { | ||
debug!("relocate upvar is set to no-op"); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this test happen outside the call to run
?
&self, | ||
tcx: TyCtxt<'tcx>, | ||
body: &mut Body<'tcx>, | ||
local_upvar_map: &mut IndexVec<FieldIdx, Option<Local>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the Option
?
enlarged_storage_conflicts | ||
} else { | ||
storage_conflicts | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this happen inside locals_live_across_suspend_points
?
// EMIT_MIR_FOR_EACH_PANIC_STRATEGY | ||
|
||
// EMIT_MIR coroutine_relocate_upvars.main-{closure#0}.RelocateUpvars.before.mir | ||
// EMIT_MIR coroutine_relocate_upvars.main-{closure#0}.RelocateUpvars.after.mir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! Could we have RelocateUpvars.diff
only?
|| { | ||
x = String::new(); | ||
yield; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding a case where _1
contains some upvars by value?
test_async_drop([AsyncInt(1), AsyncInt(2)], 104).await; | ||
test_async_drop((AsyncInt(3), AsyncInt(4)), 120).await; | ||
test_async_drop(5, 16).await; | ||
test_async_drop(Int(0), [16, 24][cfg!(classic) as usize]).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the classic
case regressing?
let mut new_decl = LocalDecl::new(ty, span); | ||
if immutable { | ||
new_decl = new_decl.immutable(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still useful?
Default, | ||
Decodable, | ||
Encodable | ||
)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: split in two derive
calls.
span, | ||
Some((source_info.span, from_awaited_ty)), | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still useful?
patch.apply(body); | ||
|
||
// Manually patch so that prologue is the new entry-point | ||
let preds = body.basic_blocks.predecessors()[START_BLOCK].clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, @tmiasko pointed that the start block cannot have incoming edges. So we can do simpler and just add the statements to it.
Replace #127522
Related to #62958
The problem statement
#62958 demonstrates two problems. One is that upvars are always unconditionally promoted to prefix data fields of the state machine; the other is that the opportunity to achieve a more compact data layout is lost because captured upvars are not subjected to liveness analysis, in the sense that the memory space at one point occupied by upvars is never reclaimed and made available for other saved data across certain yield points, even when they are dead at those suspension locations.
The second problem is better demonstrated with this code snippet.
The difficulty lies with the fact that captured upvars do not receive their own locals inside a coroutine body. If we can assign locals to them somehow, we can run the layout scheme as usual and the optimisation on the data layout comes into effect out of the box in most cases.
Proposed changes
This is an initial work to improve memory economy of coroutine and
async
futures, by reducing the unnecessary of promotion of captured upvars into state prefix. In a nutshell, this patch works along the idea in this comment and this comment.The patch contains the following changes.
RelocateUpvar
MIR pass that inserts a MIR gadget, through which captured values by coroutine orasync
bodies or closures are moved into the inner MIR locals. This opens opportunities to subject the captured upvars to the same liveness analysis right before theStateTransform
rewrites and determine which are the necessary ones to be stored in the coroutine state during suspension.prefix
data regions of coroutine states. Instead, they are moved into theUnresumed
state, or by convention the first variants of the state ADTs.prefix
after all, we further arrange the coroutine state data layout, so that their offsets in theUnresumed
state coincide with their memory slots after promotion. This means that during codegen, the additional moves introduced by theRelocateUpvar
gadget are actually elided. The relevant change is implemented inrustc_abi
.Unresumed
variant.-Z pack-coroutine-layout=captures-only
. The default ispack-coroutine-layout=no
, so that we keep the layout aligned with the stable.Other than upvars, the coroutine state data layout scheme remains largely the same.
yanked
# Design decisionsWhy does this patch not perform relocation as part of the
StateTransform
pass?This idea is explored in #120168 already back in 2023. The conclusion then was that it does not interact well with MIR dataflow analysis. It requires
StateTransform
pass to assign a virtual "MIR local" to each upvars at the beginning. Apparently this created difficulty in reviewing the piece as soon as we overload this hugeStateTransform
pass with this additional renumbering work. The idea has always been that it is better to perform the renumbering in its own pass, to keepStateTransform
simple.This patch has gone further to carry out the re-write as early as possible, so that the passes in between can perform rewrites as per current MIR local semantics and optimisation rules.
Further optimisation to be implemented behind a feature gate
Point 4 mentions that any local to be saved across suspensions will be promoted whenever they are alive across two or more yield locations. We would like to run an experiment behind a feature gate on improvements of the layout scheme. For ease of reviewing, it is better to drop this part of work from this PR. Nevertheless, the idea runs along the implementation in #127522 and we intend to propose a second PR just for that.
Old PR description
Good day, this PR is related to #127522 and it is made easier to the public to test out a new coroutine/`async` state machine directly.Prepare the compiler for tests
For starter, you may build the compiler as prescribed in the
rustc-dev-guide
instruction. If a test in the docker container is desirable, you may build this compiler withsrc/ci/docker/run.sh dist-x86_64-linux --dev
forx86_64
and package the compiler with../x dist
to produce the artifacts inobj/dist-x86_64-linux/build/dist
. This Dockerfile gets you a working Rust builder image which allows you to build your Rust applications inbookworm
.The state of performance
So far with this patch, I have been studying the performance impact on the cases of
tokio
's single- and multi-threaded runtime, as well as a simpleaxum
HTTP service. As far as I can see, I can find a change in performance characteristics that are statistically significant, one-sidedp = 0.05
.This time, I would like to call for pooling in your valuable assessments and thoughts on this patch. I kindly request experiments from you and hopefully you can provide regression cases with
perf record -e cycles:u,instructions:u,cache-misses:u
reports.Thank you all so much! 🙇