Skip to content

Conversation

@phip1611
Copy link
Member

@phip1611 phip1611 commented Nov 25, 2025

This extends the internal API with a vm_progress function and adds a vm.migration-progress HTTP endpoint including support in ch-remote via ch-remote migration-process to query the latest migration progress.

The two major pre-requisites were:

Now, it is possible to get information about ongoing migrations. The most interesting part is the pre-copy phase. The first version is rather coarse-grained with one update per memory iteration. More to follow.

Steps Before Merge

  • test it locally using ch-remote
  • add libvirt-tests testcase and verify everything works
  • deploy it on a node in SAP land and see if it works

@phip1611 phip1611 self-assigned this Nov 25, 2025
@phip1611 phip1611 force-pushed the poc-migration-statistics branch from e29356d to 4101b11 Compare November 28, 2025 08:46
/// The final memory transmission info.
memory_info: MemoryTransmissionInfo,
},
Failed {
Copy link
Member Author

@phip1611 phip1611 Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently I don't like that this phase is part of this enum. Instead, on the top level we should have

Ongoing(InnerStateA), Cancelled(InnerStateB), Failed(InnerStateC)

pub memory_bytes_transmitted: u64,
pub memory_pages_4k_transmitted: u64,
pub memory_bytes_remaining_iteration: u64,
pub memory_bytes_remaining_total: u64,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qemu also reports the dirty rate. Do we want to report this too?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's also missing is some sort of status (migration in-progess, migration-finished, ...)

Copy link
Member Author

@phip1611 phip1611 Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emu also reports the dirty rate. Do we want to report this too?

100%, sure. I just forgot it here

migration in-progess, migration-finished

I have it already, but I want to refactor it to make it more promiment. See my comment #43 (review)

@phip1611 phip1611 force-pushed the poc-migration-statistics branch 5 times, most recently from f49e577 to fdf5858 Compare December 9, 2025 11:42
let clear = matches
.subcommand_matches("migration-progress")
.unwrap()
.get_one::<bool>("clear")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can completely nuke that clear complexity - there is no problem with keeping the old state around. As soon as a new migration starts, the state is overwritten anyway

}

#[derive(Clone, Deserialize, Serialize, Debug)]
pub struct VmMigrationProgressData {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can completely nuke that clear complexity - there is no problem with keeping the old state around. As soon as a new migration starts, the state is overwritten anyway

@phip1611 phip1611 force-pushed the poc-migration-statistics branch from f612cf6 to e9a3321 Compare December 15, 2025 14:24
@phip1611 phip1611 force-pushed the poc-migration-statistics branch from e9a3321 to ecb5f45 Compare January 8, 2026 14:31
@phip1611 phip1611 changed the base branch from gardenlinux-v48 to gardenlinux January 8, 2026 14:32
@phip1611 phip1611 force-pushed the poc-migration-statistics branch 3 times, most recently from e6c80dd to fe8cd0d Compare January 12, 2026 16:30
@phip1611 phip1611 marked this pull request as ready for review January 12, 2026 16:30
@phip1611 phip1611 changed the title WIP XXX Migration Statistics Add Support for Migration Statistics API Call Jan 12, 2026
vm.complete_migration()?;

// Give management software a chance to fetch the migration state.
info!("Sleeping five seconds before shutting off.");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO remove this before merge

info!("Sleeping five seconds before shutting off.");
// TODO right now, the http-server is single-threaded and the blocking
// start-migration API call will block other requests here.
thread::sleep(Duration::from_secs(5));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO remove this before merge

@phip1611 phip1611 marked this pull request as draft January 12, 2026 16:35
@phip1611 phip1611 force-pushed the poc-migration-statistics branch from fe8cd0d to a87fe95 Compare January 12, 2026 16:41
@phip1611 phip1611 requested a review from tpressure January 12, 2026 16:42
@phip1611 phip1611 force-pushed the poc-migration-statistics branch from a87fe95 to 6d68c0c Compare January 12, 2026 16:43
/// [live-migration protocol]: super::protocol
#[derive(Clone, Debug, serde::Serialize, serde::Deserialize)]
pub struct MigrationProgressAndStatus {
/// UNIX timestamp of the start of the live-migration process.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note that this structure will be public API for ever and once we reploy it, it will be hard to change

///
/// [live-migration protocol]: super::protocol
#[derive(Clone, Debug, serde::Serialize, serde::Deserialize)]
pub struct MigrationProgressAndStatus {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I add a version parameter here? Similar to live migration which also has (or will have?) a version?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be reasonable.

@phip1611 phip1611 force-pushed the poc-migration-statistics branch from 6d68c0c to c934e0e Compare January 13, 2026 10:36
On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
This extends the internal API for an `vm_migration` API call to query
information about an ongoing live-migration. This is a crucial feature
to enable production-ready live-migration at large-scale deployments
with corresponding monitoring.

On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
This adds the HTTP endpoint to export ongoing VM live-migration
progress.

This work was made possible because of the following fundamental
prerequisites:
- internal API was made async
- http thread was made async

This way, one can send requests to fetch the latest state without
blocking anywhere.

On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
The first version has the limitation that we populate the latest
snapshot once per memory iteration, although this is the most
interesting part by far. In a follow-up, we can make this more
fine-grained.

On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
Signed-off-by: Philipp Schuster <[email protected]>
@phip1611 phip1611 force-pushed the poc-migration-statistics branch from c934e0e to 78699eb Compare January 13, 2026 10:42
Copy link

@olivereanderson olivereanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this!

I wish the PR description, or some other form of documentation, explained the intended use cases for this feature.

Is the main use case debugging slow/hanging live migrations, or is this intended to be used to collect metrics for every live migration which will then be used to provide statistical insights? If it is the latter, it would be good to document (somewhere) the compatibility story with standard metrics services like Prometheus. I think @snue kowns a thing or two about such topics 🙂

The metrics crate seems to be the metrics analogue of log/tracing and it would be good to know why that isn't a good fit for the task(s) addressed by this PR. I am assuming that there is some context I am missing and that it is quite possible that I have some misconceptions about the problems you are trying to address.

Otherwise I would just like to say that the code looks good and it was easy to follow the implementation for the most part.

Comment on lines +45 to +49
pub memory_bytes_remaining_iteration: u64,
/// The amount of transmitted 4k pages.
pub memory_pages_4k_transmitted: u64,
/// The amount of remaining 4k pages for this iteration.
pub memory_pages_4k_remaining_iteration: u64,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What about larger pages?

Copy link
Member Author

@phip1611 phip1611 Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VMM + KVM transparently splits huge-pages for a migration. I'm however not sure if this also applies to read-only pages. We should discuss this again with Thomas

Comment on lines +86 to +115
pub enum MigrationProgressState {
/// The migration has been cancelled.
Cancelled {
/// The latest memory transmission info, if any.
memory_transmission_info: MemoryTransmissionInfo,
},
/// The migration has failed.
Failed {
/// The last memory transmission info, if any.
memory_transmission_info: MemoryTransmissionInfo,
/// Stringified error.
error_msg: String,
/// Debug-stringified error.
error_msg_debug: String,
// TODO this is very tricky because I need clone()
// error: Box<dyn Error>,
},
/// The migration has finished successfully.
Finished {
/// The last memory transmission info, if any.
memory_transmission_info: MemoryTransmissionInfo,
},
/// The migration is ongoing.
Ongoing {
phase: MigrationPhase,
memory_transmission_info: MemoryTransmissionInfo,
/// Percent in range `0..=100`.
vcpu_throttle_percent: u8,
},
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that every variant has a field of type MemoryTransmissionInfo. Maybe it would make sense to take that out of the enum?
You could do that either with a wrapper struct, or just work with a tuple e.g.:

pub type MigrationProgressInfo = (MigrationProgressState, MigrationProgressInfo)`;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about this again!

Comment on lines +118 to +137
fn memory_transmission_info(&self) -> MemoryTransmissionInfo {
match self {
MigrationProgressState::Cancelled {
memory_transmission_info,
..
} => *memory_transmission_info,
MigrationProgressState::Failed {
memory_transmission_info,
..
} => *memory_transmission_info,
MigrationProgressState::Finished {
memory_transmission_info,
..
} => *memory_transmission_info,
MigrationProgressState::Ongoing {
memory_transmission_info,
..
} => *memory_transmission_info,
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above with regards to taking MemoryTransmissionInfo out of the enum.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about this again! I iterates multiple times over the design. It might also makes sense to move that out of the enum at this place, yes

Comment on lines +149 to +155
match self {
MigrationProgressState::Ongoing {
vcpu_throttle_percent,
..
} => Some(*vcpu_throttle_percent),
_ => None,
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if let would be more idiomatic here. Something like this:

Suggested change
match self {
MigrationProgressState::Ongoing {
vcpu_throttle_percent,
..
} => Some(*vcpu_throttle_percent),
_ => None,
}
if let Self::Ongoing{vcpu_throttle_percent, ..} = self {
Some(*vcpu_throttle_percent)
} else {
None
}

fn current_unix_timestamp_ms() -> u64 {
SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using expect instead of unwrap in this case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message contains:

an vm_migration API call

but it should be "a vm_migration API call".

Comment on lines +322 to +323
/// The progress of a possibly ongoing live migration.
VmMigrationProgress(Box<Option<MigrationProgressAndStatus>>),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the None variant will only ever be present if the API is called before a live migration has started?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

{
Ok(info) => {
let mut response = Response::new(Version::Http11, StatusCode::OK);
let info_serialized = serde_json::to_string(&info).unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It would be good to either replace unwrap with expect or to add a comment explaining why it is OK to unwrap in this case.

Comment on lines +2249 to +2259
if let Some(snapshot) = lock.as_ref() {
match snapshot.state {
MigrationProgressState::Ongoing { .. } => {
// if this fails, we made a programming error in our state handling
panic!("migration already ongoing");
}
MigrationProgressState::Cancelled { .. } => {}
MigrationProgressState::Failed { .. } => {}
MigrationProgressState::Finished { .. } => {}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Something like this might be simpler:

Suggested change
if let Some(snapshot) = lock.as_ref() {
match snapshot.state {
MigrationProgressState::Ongoing { .. } => {
// if this fails, we made a programming error in our state handling
panic!("migration already ongoing");
}
MigrationProgressState::Cancelled { .. } => {}
MigrationProgressState::Failed { .. } => {}
MigrationProgressState::Finished { .. } => {}
}
}
if lock.as_ref().is_some_and(|snapshot| matches!(snapshot, MigrationProgressState::Ongoing{..}) {
// if this fails, we made a programming error in our state handling
panic!("migration already ongoing");
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants