Skip to content

Comments

lib: migrate overlay manager state explicitly#853

Closed
gjcolombo wants to merge 4 commits intomasterfrom
gjcolombo/overlay-pages-part-2
Closed

lib: migrate overlay manager state explicitly#853
gjcolombo wants to merge 4 commits intomasterfrom
gjcolombo/overlay-pages-part-2

Conversation

@gjcolombo
Copy link
Contributor

@gjcolombo gjcolombo commented Feb 10, 2025

(N.B. Stacked on #851.)

Explicitly migrate the state of a Hyper-V stack's OverlayManager instead of having the stack reapply its own overlays. The manager's OverlaySets are exported and imported more or less as-is, though with some transformations between types for serialization purposes (e.g. Box<u8; PAGE_SIZE> is serialized as Vec<u8>) or for compatibility reasons (enum MigratedOverlayKind is distinct from enum OverlayKind so that the latter can change without rendering the manager unable to recognize overlay kinds from old Propolis versions).

Importing manager state produces a set of OverlayPage registrations that the higher-level Hyper-V import logic can audit for correctness (there should be exactly as many entries as MSRs with active overlays, and they should be associated with the correct PFNs).

Tests: cargo test and PHD; examined the migration logs from an ad hoc migration and verified that the expected overlay state was migrated.

Fixes #850.

@gjcolombo gjcolombo force-pushed the gjcolombo/overlay-pages-part-2 branch from 3be5036 to 33e0de9 Compare February 11, 2025 17:32
@gjcolombo gjcolombo requested review from hawkw and pfmooney February 11, 2025 17:38
@gjcolombo gjcolombo marked this pull request as ready for review February 11, 2025 17:38
@gjcolombo gjcolombo force-pushed the gjcolombo/overlay-pages-part-1 branch from 07e3c11 to 161729b Compare February 11, 2025 22:54
@gjcolombo gjcolombo force-pushed the gjcolombo/overlay-pages-part-2 branch from 33e0de9 to 78057ac Compare February 12, 2025 00:17
Base automatically changed from gjcolombo/overlay-pages-part-1 to master February 14, 2025 18:39
@gjcolombo gjcolombo force-pushed the gjcolombo/overlay-pages-part-2 branch from 78057ac to 2e8e741 Compare February 14, 2025 18:40
@gjcolombo
Copy link
Contributor Author

The first PHD run on the most recent commit hit known test flake #811 (really need to go fix that...); I've asked for a re-run.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! i might wait for @pfmooney's ack before merging?

}

#[test]
fn export_import_round_trip() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take it or leave it: this type of "round trip" test always feels like kind of a good candidate for a property test where we generate a bunch of random things and assert they look the same on both sides. but, i'm not sure if it's really going to gain that much from it, and writing the proptest code to generate arbitrary migration payloads might be too much work...just a thought!

Comment on lines +580 to +594
.flat_map(|(pfn, set)| {
std::iter::once(OverlayPage {
manager: Arc::downgrade(manager),
kind: set.active,
pfn: *pfn,
})
.chain(set.pending.keys().map(|kind| {
OverlayPage {
manager: Arc::downgrade(manager),
kind: *kind,
pfn: *pfn,
}
}))
})
.collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glad this worked :)

Comment on lines +740 to +744
pub struct OverlaySetV1 {
pub(super) original_contents: Vec<u8>,
pub(super) active: MigratedOverlayKind,
pub(super) pending: BTreeMap<MigratedOverlayKind, Vec<u8>>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a high level, I'm not sure I'm wild about migrating the overlay state like this. We're essentially duplicating some state, vs what we have stored for the msr_hypercall data in HyperVEnlightenmentV1 (that a hypercall page is overlaying a given address). There is still the matter of the covered page data, but that could go along with the msr_hypercall bits, in theory (in terms of which part of the migration payload it should be attached to).

That way, it does less in the way of encoding the OverlayManager as part of the migration payload, rather than it being an implementation detail at the time. (Partially thinking about if/when the kernel vmm makes overlay pages less of a hassle.)

Now, this isn't to say that I'm fundamentally opposed to how things are structured here. If you think this is the right direction for all this, I'll defer. I'm happy to chat about trade-offs as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed some of the tradeoffs here offline. A couple of notes from that conversation:

  • The primary thing I wanted to avoid here (if we can help it) is reordering overlays across a migration. In today's code, the first overlay applied to a PFN is always made active, irrespective of its kind; subsequent overlays are then promoted in order by their kind. However, there are other ways to tackle this problem that don't require anyone to save/restore what overlay is currently active (e.g., make the overlay ordering solely a function of the kinds of overlays present for a page).

  • Once ordering concerns are out of the picture, the Hyper-V stack can simply re-create its OverlayPages during import, knowing that it doesn't have to care about creating them in any particular order. Another advantage of this is that it's not necessary to save/restore pending page contents--the stack just needs to remember the parameters it used to create a particular page so that it can create the correct contents again during import.

  • It's almost possible to avoid transferring a GPA's original contents by disabling active overlays when the Hyper-V stack is paused and re-creating them as needed when it's resumed. Then the original page contents will be copied during the RAM transfer phase. The tricky bit with this is that it means Lifecycle callers have to pause vCPUs before they pause device components (otherwise a running CPU might read a page whose overlay is held in abeyance by a paused Hyper-V manager).

I put together a small change that explores some of these ideas. I'll post it as a draft PR shortly.

@gjcolombo
Copy link
Contributor Author

#861 supersedes this. I'll close this PR once that merges and #856 is rebased.

@gjcolombo
Copy link
Contributor Author

856 is rebased, so closing this.

@gjcolombo gjcolombo closed this Feb 21, 2025
@gjcolombo gjcolombo deleted the gjcolombo/overlay-pages-part-2 branch February 25, 2025 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

lib: want better Hyper-V overlay page management

3 participants