Skip to content

Conversation

dingxiangfei2009
Copy link
Contributor

@dingxiangfei2009 dingxiangfei2009 commented Jan 15, 2025

Replace #127522
Related to #62958

The problem statement

#62958 demonstrates two problems. One is that upvars are always unconditionally promoted to prefix data fields of the state machine; the other is that the opportunity to achieve a more compact data layout is lost because captured upvars are not subjected to liveness analysis, in the sense that the memory space at one point occupied by upvars is never reclaimed and made available for other saved data across certain yield points, even when they are dead at those suspension locations.

The second problem is better demonstrated with this code snippet.

async fn work(another_fut: impl Future) {
    let _ = another_fut.await;
    // now `another_fut` is consumed
    let next_fut = async { .. };
    next_fut.await;
}

// `work`'s layout needs to reserve space for both `another_fut` and `next_fut`, while there is a clear missed opportunity
// to overlap the memory for `another_fut` and `next_fut` for better memory economy.

The difficulty lies with the fact that captured upvars do not receive their own locals inside a coroutine body. If we can assign locals to them somehow, we can run the layout scheme as usual and the optimisation on the data layout comes into effect out of the box in most cases.

Proposed changes

This is an initial work to improve memory economy of coroutine and async futures, by reducing the unnecessary of promotion of captured upvars into state prefix. In a nutshell, this patch works along the idea in this comment and this comment.

The patch contains the following changes.

  1. Introduction of a RelocateUpvar MIR pass that inserts a MIR gadget, through which captured values by coroutine or async bodies or closures are moved into the inner MIR locals. This opens opportunities to subject the captured upvars to the same liveness analysis right before the StateTransform rewrites and determine which are the necessary ones to be stored in the coroutine state during suspension.
  2. With this gadget, it means that we do not have to keep all upvars in the so-called prefix data regions of coroutine states. Instead, they are moved into the Unresumed state, or by convention the first variants of the state ADTs.
  3. In addition, in case that some upvars are eventually used across more than one suspension point, which leads to their promotion into the prefix after all, we further arrange the coroutine state data layout, so that their offsets in the Unresumed state coincide with their memory slots after promotion. This means that during codegen, the additional moves introduced by the RelocateUpvar gadget are actually elided. The relevant change is implemented in rustc_abi.
  4. We then have to pay the lip service to translate direct field access to the upvars into access behind the Unresumed variant.
  5. We have to update diagnostics so that they are more informed about captured values and they make more sense in view of this change.
  6. As requested by the review comments, the relocation only applies behind an unstable compiler flag -Z pack-coroutine-layout=captures-only. The default is pack-coroutine-layout=no, so that we keep the layout aligned with the stable.

Other than upvars, the coroutine state data layout scheme remains largely the same.

yanked # Design decisions

Why does this patch not perform relocation as part of the StateTransform pass?

This idea is explored in #120168 already back in 2023. The conclusion then was that it does not interact well with MIR dataflow analysis. It requires StateTransform pass to assign a virtual "MIR local" to each upvars at the beginning. Apparently this created difficulty in reviewing the piece as soon as we overload this huge StateTransform pass with this additional renumbering work. The idea has always been that it is better to perform the renumbering in its own pass, to keep StateTransform simple.

This patch has gone further to carry out the re-write as early as possible, so that the passes in between can perform rewrites as per current MIR local semantics and optimisation rules.

Further optimisation to be implemented behind a feature gate

Point 4 mentions that any local to be saved across suspensions will be promoted whenever they are alive across two or more yield locations. We would like to run an experiment behind a feature gate on improvements of the layout scheme. For ease of reviewing, it is better to drop this part of work from this PR. Nevertheless, the idea runs along the implementation in #127522 and we intend to propose a second PR just for that.

Old PR description Good day, this PR is related to #127522 and it is made easier to the public to test out a new coroutine/`async` state machine directly.

Prepare the compiler for tests

For starter, you may build the compiler as prescribed in the rustc-dev-guide instruction. If a test in the docker container is desirable, you may build this compiler with src/ci/docker/run.sh dist-x86_64-linux --dev for x86_64 and package the compiler with ../x dist to produce the artifacts in obj/dist-x86_64-linux/build/dist. This Dockerfile gets you a working Rust builder image which allows you to build your Rust applications in bookworm.

The state of performance

So far with this patch, I have been studying the performance impact on the cases of tokio's single- and multi-threaded runtime, as well as a simple axum HTTP service. As far as I can see, I can find a change in performance characteristics that are statistically significant, one-sided p = 0.05.

This time, I would like to call for pooling in your valuable assessments and thoughts on this patch. I kindly request experiments from you and hopefully you can provide regression cases with perf record -e cycles:u,instructions:u,cache-misses:u reports.

Thank you all so much! 🙇

@rustbot
Copy link
Collaborator

rustbot commented Jan 15, 2025

r? @BoxyUwU

rustbot has assigned @BoxyUwU.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 15, 2025
@rustbot
Copy link
Collaborator

rustbot commented Jan 15, 2025

Some changes occurred in compiler/rustc_codegen_cranelift

cc @bjorn3

Some changes occurred to the CTFE / Miri interpreter

cc @rust-lang/miri

Some changes occurred to MIR optimizations

cc @rust-lang/wg-mir-opt

Some changes occurred to the CTFE machinery

cc @rust-lang/wg-const-eval

@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Collaborator

bors commented Jan 19, 2025

☔ The latest upstream changes (presumably #135715) made this pull request unmergeable. Please resolve the merge conflicts.

@BoxyUwU BoxyUwU removed their assignment Jan 28, 2025
@BoxyUwU
Copy link
Member

BoxyUwU commented Jan 28, 2025

I don't think this needs a reviewer?

@dingxiangfei2009 dingxiangfei2009 force-pushed the move-upvars-to-locals-for-tests branch from 3e6a399 to 9603ad6 Compare January 28, 2025 23:17
@traviscross
Copy link
Contributor

cc @Darksonn @tmandry @eholk @rust-lang/wg-async

Ding here is reworking the layout of coroutines to try to reduce their memory footprint (and that of Futures). He's curious to find whether this introduces any performance or other regressions. In this own testing, he's not been able to find any, but he's curious in more data and experience here to help inform whether this is a worthwhile change.

What do people think?

@rust-log-analyzer

This comment has been minimized.

@tmandry
Copy link
Member

tmandry commented Jan 30, 2025

For anyone searching for a description of what this PR changes, it's summarized at the top of compiler/rustc_mir_transform/src/coroutine/relocate_upvars.rs.

Comment on lines 20 to 21
//! The reason is that it is possible that coroutine layout may change and the source memory location of
//! an upvar may not necessarily be mapped exactly to the same place as in the `Unresumed` state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we decide the offsets of upvars in Unresumed in the same place as we decide the offset of saved locals? Couldn't we then "backpropagate" the field offsets for each upvar's local as the offset for the corresponding upvar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing! I had a backlog of things due to sickness.

True indeed. This statement is completely voided by the work in the second commit. I will reword this section in the following way.


By enabling the feature gate coroutine_new_layout the field offsets of the upvars in Unresumed state are further exactly placed in the same place as their corresponding saved locals, which is guaranteed by the alternative coroutine layout calculator that enters in effect. <... quote the relevant comment/file/etc. ...>

@tmandry
Copy link
Member

tmandry commented Jan 30, 2025

I don't personally have any means of performance testing this at the moment. It would be much easier if it landed behind a feature gate.

@bors
Copy link
Collaborator

bors commented Jan 31, 2025

☔ The latest upstream changes (presumably #135318) made this pull request unmergeable. Please resolve the merge conflicts.

@nikomatsakis
Copy link
Contributor

nikomatsakis commented Feb 1, 2025 via email

@dingxiangfei2009
Copy link
Contributor Author

@tmandry

I don't personally have any means of performance testing this at the moment. It would be much easier if it landed behind a feature gate.

I think it is fair to land with a feature gate so that we can get to play with it. The PR has temporarily disabled the check on the feature gate. However, given that coroutine layout data is keyed individually by their DefId, I think it is still safe to allow code to link to each other even when the feature gate status varies among the crates.

@dingxiangfei2009 dingxiangfei2009 force-pushed the move-upvars-to-locals-for-tests branch from 9603ad6 to 3a1e04a Compare February 9, 2025 20:19
@rust-log-analyzer

This comment has been minimized.

@dingxiangfei2009 dingxiangfei2009 force-pushed the move-upvars-to-locals-for-tests branch from 3a1e04a to 61d4bbd Compare February 9, 2025 23:12
@rust-log-analyzer

This comment has been minimized.

@eholk
Copy link
Contributor

eholk commented Feb 12, 2025

I don't personally have any means of performance testing this at the moment. It would be much easier if it landed behind a feature gate.

Would this be better as a #[feature(...)] gate, or as -Z new_coroutine_layout? I think the compiler flag feels like a better fit for something like this.

@oli-obk
Copy link
Contributor

oli-obk commented Feb 13, 2025

Are there any issues if only one crate activates it but others do not? if there are no issues, a feature gate seems ok (and easier to use ^^)

@bors
Copy link
Collaborator

bors commented Feb 14, 2025

☔ The latest upstream changes (presumably #137030) made this pull request unmergeable. Please resolve the merge conflicts.

@Dirbaio
Copy link
Contributor

Dirbaio commented Feb 15, 2025

A feature doesn't allow turning it on for the whole build, you'd have to fork every single crate that uses async. A -Z flag would be better IMO.

@tmandry
Copy link
Member

tmandry commented Feb 18, 2025

Agreed on a -Z flag being better for testing for the reason @Dirbaio gave.

If my understanding is correct, we shouldn't expect any regression from this approach (only upside), but since we currently rely on later passes eliding copies there might be some regression. We could be more aggressive in eliding the copies ourselves, but maybe this is hard.

@dingxiangfei2009
Copy link
Contributor Author

Thanks for looking into this!

I will have time this week to clean this up a bit and I will ask rustbot to set it to ready-for-review.

@dingxiangfei2009 dingxiangfei2009 force-pushed the move-upvars-to-locals-for-tests branch from 61d4bbd to 0ff7e65 Compare March 9, 2025 22:29
@rustbot
Copy link
Collaborator

rustbot commented Mar 9, 2025

Some changes occurred in compiler/rustc_codegen_ssa

cc @WaffleLapkin

@traviscross
Copy link
Contributor

@craterbot
Copy link
Collaborator

👌 Experiment pr-135527-4 created and queued.
🤖 Automatically detected try build 7c7ac84
🔍 You can check out the queue and this experiment's details.

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@craterbot craterbot added S-waiting-on-crater Status: Waiting on a crater run to be completed. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 5, 2025
@craterbot
Copy link
Collaborator

🚧 Experiment pr-135527-4 is now running

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@craterbot
Copy link
Collaborator

🎉 Experiment pr-135527-4 is completed!
📊 11 regressed and 0 fixed (29 total)
📊 14 spurious results on the retry-regessed-list.txt, consider a retry1 if this is a significant amount.
📰 Open the summary report.

⚠️ If you notice any spurious failure please add them to the denylist!
ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

Footnotes

  1. re-run the experiment with crates=https://crater-reports.s3.amazonaws.com/pr-135527-4/retry-regressed-list.txt

@craterbot craterbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-crater Status: Waiting on a crater run to be completed. labels Oct 5, 2025
@traviscross
Copy link
Contributor

@craterbot run p=1 mode=build-and-test crates=https://crater-reports.s3.amazonaws.com/pr-135527-4/retry-regressed-list.txt

@craterbot
Copy link
Collaborator

👌 Experiment pr-135527-5 created and queued.
🤖 Automatically detected try build 7c7ac84
🔍 You can check out the queue and this experiment's details.

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@craterbot craterbot added S-waiting-on-crater Status: Waiting on a crater run to be completed. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 5, 2025
@craterbot
Copy link
Collaborator

🚧 Experiment pr-135527-5 is now running

ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

@craterbot
Copy link
Collaborator

🎉 Experiment pr-135527-5 is completed!
📊 10 regressed and 1 fixed (25 total)
📊 13 spurious results on the retry-regessed-list.txt, consider a retry1 if this is a significant amount.
📰 Open the summary report.

⚠️ If you notice any spurious failure please add them to the denylist!
ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more

Footnotes

  1. re-run the experiment with crates=https://crater-reports.s3.amazonaws.com/pr-135527-5/retry-regressed-list.txt

@craterbot craterbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-crater Status: Waiting on a crater run to be completed. labels Oct 5, 2025
@dingxiangfei2009
Copy link
Contributor Author

I am checking some of the crates: their tests are somewhat flaky and they are not using async anyway.

... and treat coroutine upvar captures as saved locals as well.

This allows the liveness analysis to determine which captures are truly
saved across a yield point and which are initially used but discarded at
first yield points.

In the event that upvar captures are promoted, most certainly because
a coroutine suspends at least once, the slots in the promotion prefix
shall be reused. This means that the copies emitted in the upvar
relocation MIR pass will eventually elided and eliminated in the codegen
phase, hence no additional runtime cost is realised.

Additional MIR dumps are inserted so that it is easier to inspect the
bodies of async closures, including those that captures the state
by-value.

Debug information is updated to point at the correct location for upvars
in borrow checking errors and final debuginfo.

A language change that this patch enables is now actually reverted,
so that lifetimes on relocated upvars are invariant with the upvars outside
of the coroutine body.
We are deferring the language change to a later discussion.

Co-authored-by: Dario Nieuwenhuis <[email protected]>

Signed-off-by: Xiangfei Ding <[email protected]>
@dingxiangfei2009 dingxiangfei2009 force-pushed the move-upvars-to-locals-for-tests branch from 073de09 to f135989 Compare October 8, 2025 14:55
@rustbot
Copy link
Collaborator

rustbot commented Oct 8, 2025

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@dingxiangfei2009
Copy link
Contributor Author

@rustbot ready

  • I fixed a test that has to do with how internal field types of a coroutine state machine should be treated in the eyes of the borrow checker; basically the relocated upvars are actually external

r? oli-obk

Maybe cc @RalfJung as well. I am not sure if the changes to MIR would interfere with the aggressive optimisation project. Since the last review, the MIR transformation pass is completely in StateTransform itself. We do not need to track information across any other MIR optimisation pass anymore.

@rustbot rustbot assigned oli-obk and unassigned BoxyUwU Oct 8, 2025
@rustbot
Copy link
Collaborator

rustbot commented Oct 8, 2025

oli-obk is not on the review rotation at the moment.
They may take a while to respond.

@dingxiangfei2009
Copy link
Contributor Author

Tagging @oli-obk for review because I was advised that for MIR related changes I should best approach you as well.

@cjgillot cjgillot self-assigned this Oct 8, 2025
@RalfJung
Copy link
Member

RalfJung commented Oct 8, 2025

I don't know much about coroutine / closure lowering and don't have the time to dig into it right now. Apart from Oli, I am not sure who our main experts for that are (other than @eddyb who's not really around any more)... @compiler-errors @cjgillot @lcnr ?

Copy link
Contributor

@cjgillot cjgillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read everything yet. These are my first comments.

View changes since this review

self.show_mutating_upvar(tcx, id.expect_local(), the_place_err, err);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this change land independently?


/// This map `A -> B` allows later MIR passes, error reporters
/// and layout calculator to relate saved locals `A` sourced from upvars
/// and locals `B` that upvars are moved into.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit the comment? What exactly is relocated_upvars[A]?

storage_conflicts,
relocated_upvars,
pack: PackCoroutineLayout::CapturesOnly,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you simplify this function? You don't need to iterate over upvar_tys or upvar_saved_locals when there is only one entry.

#[inline]
fn prefix_tys(self) -> &'tcx List<Ty<'tcx>> {
self.upvar_tys()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ease review and git blame, do you mind splitting this change into its own commit?

Comment on lines +205 to +208
if !self.0 {
debug!("relocate upvar is set to no-op");
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this test happen outside the call to run?

&self,
tcx: TyCtxt<'tcx>,
body: &mut Body<'tcx>,
local_upvar_map: &mut IndexVec<FieldIdx, Option<Local>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the Option?

enlarged_storage_conflicts
} else {
storage_conflicts
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this happen inside locals_live_across_suspend_points?

// EMIT_MIR_FOR_EACH_PANIC_STRATEGY

// EMIT_MIR coroutine_relocate_upvars.main-{closure#0}.RelocateUpvars.before.mir
// EMIT_MIR coroutine_relocate_upvars.main-{closure#0}.RelocateUpvars.after.mir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! Could we have RelocateUpvars.diff only?

|| {
x = String::new();
yield;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding a case where _1 contains some upvars by value?

test_async_drop([AsyncInt(1), AsyncInt(2)], 104).await;
test_async_drop((AsyncInt(3), AsyncInt(4)), 120).await;
test_async_drop(5, 16).await;
test_async_drop(Int(0), [16, 24][cfg!(classic) as usize]).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the classic case regressing?

let mut new_decl = LocalDecl::new(ty, span);
if immutable {
new_decl = new_decl.immutable();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still useful?

Default,
Decodable,
Encodable
)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: split in two derive calls.

span,
Some((source_info.span, from_awaited_ty)),
));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still useful?

patch.apply(body);

// Manually patch so that prologue is the new entry-point
let preds = body.basic_blocks.predecessors()[START_BLOCK].clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, @tmiasko pointed that the start block cannot have incoming edges. So we can do simpler and just add the statements to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-lang-radar Items that are on lang's radar and will need eventual work or consideration. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-clippy Relevant to the Clippy team. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.