-
Notifications
You must be signed in to change notification settings - Fork 2
Add Support for Migration Statistics API Call #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gardenlinux
Are you sure you want to change the base?
Add Support for Migration Statistics API Call #43
Conversation
e29356d to
4101b11
Compare
| /// The final memory transmission info. | ||
| memory_info: MemoryTransmissionInfo, | ||
| }, | ||
| Failed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently I don't like that this phase is part of this enum. Instead, on the top level we should have
Ongoing(InnerStateA), Cancelled(InnerStateB), Failed(InnerStateC)
vm-migration/src/progress.rs
Outdated
| pub memory_bytes_transmitted: u64, | ||
| pub memory_pages_4k_transmitted: u64, | ||
| pub memory_bytes_remaining_iteration: u64, | ||
| pub memory_bytes_remaining_total: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qemu also reports the dirty rate. Do we want to report this too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's also missing is some sort of status (migration in-progess, migration-finished, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emu also reports the dirty rate. Do we want to report this too?
100%, sure. I just forgot it here
migration in-progess, migration-finished
I have it already, but I want to refactor it to make it more promiment. See my comment #43 (review)
f49e577 to
fdf5858
Compare
src/bin/ch-remote.rs
Outdated
| let clear = matches | ||
| .subcommand_matches("migration-progress") | ||
| .unwrap() | ||
| .get_one::<bool>("clear") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can completely nuke that clear complexity - there is no problem with keeping the old state around. As soon as a new migration starts, the state is overwritten anyway
vmm/src/api/mod.rs
Outdated
| } | ||
|
|
||
| #[derive(Clone, Deserialize, Serialize, Debug)] | ||
| pub struct VmMigrationProgressData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can completely nuke that clear complexity - there is no problem with keeping the old state around. As soon as a new migration starts, the state is overwritten anyway
f612cf6 to
e9a3321
Compare
e9a3321 to
ecb5f45
Compare
e6c80dd to
fe8cd0d
Compare
| vm.complete_migration()?; | ||
|
|
||
| // Give management software a chance to fetch the migration state. | ||
| info!("Sleeping five seconds before shutting off."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO remove this before merge
| info!("Sleeping five seconds before shutting off."); | ||
| // TODO right now, the http-server is single-threaded and the blocking | ||
| // start-migration API call will block other requests here. | ||
| thread::sleep(Duration::from_secs(5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO remove this before merge
fe8cd0d to
a87fe95
Compare
a87fe95 to
6d68c0c
Compare
| /// [live-migration protocol]: super::protocol | ||
| #[derive(Clone, Debug, serde::Serialize, serde::Deserialize)] | ||
| pub struct MigrationProgressAndStatus { | ||
| /// UNIX timestamp of the start of the live-migration process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please note that this structure will be public API for ever and once we reploy it, it will be hard to change
| /// | ||
| /// [live-migration protocol]: super::protocol | ||
| #[derive(Clone, Debug, serde::Serialize, serde::Deserialize)] | ||
| pub struct MigrationProgressAndStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I add a version parameter here? Similar to live migration which also has (or will have?) a version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be reasonable.
6d68c0c to
c934e0e
Compare
On-behalf-of: SAP [email protected] Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected] Signed-off-by: Philipp Schuster <[email protected]>
This extends the internal API for an `vm_migration` API call to query information about an ongoing live-migration. This is a crucial feature to enable production-ready live-migration at large-scale deployments with corresponding monitoring. On-behalf-of: SAP [email protected] Signed-off-by: Philipp Schuster <[email protected]>
This adds the HTTP endpoint to export ongoing VM live-migration progress. This work was made possible because of the following fundamental prerequisites: - internal API was made async - http thread was made async This way, one can send requests to fetch the latest state without blocking anywhere. On-behalf-of: SAP [email protected] Signed-off-by: Philipp Schuster <[email protected]>
The first version has the limitation that we populate the latest snapshot once per memory iteration, although this is the most interesting part by far. In a follow-up, we can make this more fine-grained. On-behalf-of: SAP [email protected] Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected] Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected] Signed-off-by: Philipp Schuster <[email protected]>
c934e0e to
78699eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this!
I wish the PR description, or some other form of documentation, explained the intended use cases for this feature.
Is the main use case debugging slow/hanging live migrations, or is this intended to be used to collect metrics for every live migration which will then be used to provide statistical insights? If it is the latter, it would be good to document (somewhere) the compatibility story with standard metrics services like Prometheus. I think @snue kowns a thing or two about such topics 🙂
The metrics crate seems to be the metrics analogue of log/tracing and it would be good to know why that isn't a good fit for the task(s) addressed by this PR. I am assuming that there is some context I am missing and that it is quite possible that I have some misconceptions about the problems you are trying to address.
Otherwise I would just like to say that the code looks good and it was easy to follow the implementation for the most part.
| pub memory_bytes_remaining_iteration: u64, | ||
| /// The amount of transmitted 4k pages. | ||
| pub memory_pages_4k_transmitted: u64, | ||
| /// The amount of remaining 4k pages for this iteration. | ||
| pub memory_pages_4k_remaining_iteration: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What about larger pages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VMM + KVM transparently splits huge-pages for a migration. I'm however not sure if this also applies to read-only pages. We should discuss this again with Thomas
| pub enum MigrationProgressState { | ||
| /// The migration has been cancelled. | ||
| Cancelled { | ||
| /// The latest memory transmission info, if any. | ||
| memory_transmission_info: MemoryTransmissionInfo, | ||
| }, | ||
| /// The migration has failed. | ||
| Failed { | ||
| /// The last memory transmission info, if any. | ||
| memory_transmission_info: MemoryTransmissionInfo, | ||
| /// Stringified error. | ||
| error_msg: String, | ||
| /// Debug-stringified error. | ||
| error_msg_debug: String, | ||
| // TODO this is very tricky because I need clone() | ||
| // error: Box<dyn Error>, | ||
| }, | ||
| /// The migration has finished successfully. | ||
| Finished { | ||
| /// The last memory transmission info, if any. | ||
| memory_transmission_info: MemoryTransmissionInfo, | ||
| }, | ||
| /// The migration is ongoing. | ||
| Ongoing { | ||
| phase: MigrationPhase, | ||
| memory_transmission_info: MemoryTransmissionInfo, | ||
| /// Percent in range `0..=100`. | ||
| vcpu_throttle_percent: u8, | ||
| }, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that every variant has a field of type MemoryTransmissionInfo. Maybe it would make sense to take that out of the enum?
You could do that either with a wrapper struct, or just work with a tuple e.g.:
pub type MigrationProgressInfo = (MigrationProgressState, MigrationProgressInfo)`;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think about this again!
| fn memory_transmission_info(&self) -> MemoryTransmissionInfo { | ||
| match self { | ||
| MigrationProgressState::Cancelled { | ||
| memory_transmission_info, | ||
| .. | ||
| } => *memory_transmission_info, | ||
| MigrationProgressState::Failed { | ||
| memory_transmission_info, | ||
| .. | ||
| } => *memory_transmission_info, | ||
| MigrationProgressState::Finished { | ||
| memory_transmission_info, | ||
| .. | ||
| } => *memory_transmission_info, | ||
| MigrationProgressState::Ongoing { | ||
| memory_transmission_info, | ||
| .. | ||
| } => *memory_transmission_info, | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above with regards to taking MemoryTransmissionInfo out of the enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think about this again! I iterates multiple times over the design. It might also makes sense to move that out of the enum at this place, yes
| match self { | ||
| MigrationProgressState::Ongoing { | ||
| vcpu_throttle_percent, | ||
| .. | ||
| } => Some(*vcpu_throttle_percent), | ||
| _ => None, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if let would be more idiomatic here. Something like this:
| match self { | |
| MigrationProgressState::Ongoing { | |
| vcpu_throttle_percent, | |
| .. | |
| } => Some(*vcpu_throttle_percent), | |
| _ => None, | |
| } | |
| if let Self::Ongoing{vcpu_throttle_percent, ..} = self { | |
| Some(*vcpu_throttle_percent) | |
| } else { | |
| None | |
| } |
| fn current_unix_timestamp_ms() -> u64 { | ||
| SystemTime::now() | ||
| .duration_since(UNIX_EPOCH) | ||
| .unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using expect instead of unwrap in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message contains:
an
vm_migrationAPI call
but it should be "a vm_migration API call".
| /// The progress of a possibly ongoing live migration. | ||
| VmMigrationProgress(Box<Option<MigrationProgressAndStatus>>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the None variant will only ever be present if the API is called before a live migration has started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
| { | ||
| Ok(info) => { | ||
| let mut response = Response::new(Version::Http11, StatusCode::OK); | ||
| let info_serialized = serde_json::to_string(&info).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It would be good to either replace unwrap with expect or to add a comment explaining why it is OK to unwrap in this case.
| if let Some(snapshot) = lock.as_ref() { | ||
| match snapshot.state { | ||
| MigrationProgressState::Ongoing { .. } => { | ||
| // if this fails, we made a programming error in our state handling | ||
| panic!("migration already ongoing"); | ||
| } | ||
| MigrationProgressState::Cancelled { .. } => {} | ||
| MigrationProgressState::Failed { .. } => {} | ||
| MigrationProgressState::Finished { .. } => {} | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Something like this might be simpler:
| if let Some(snapshot) = lock.as_ref() { | |
| match snapshot.state { | |
| MigrationProgressState::Ongoing { .. } => { | |
| // if this fails, we made a programming error in our state handling | |
| panic!("migration already ongoing"); | |
| } | |
| MigrationProgressState::Cancelled { .. } => {} | |
| MigrationProgressState::Failed { .. } => {} | |
| MigrationProgressState::Finished { .. } => {} | |
| } | |
| } | |
| if lock.as_ref().is_some_and(|snapshot| matches!(snapshot, MigrationProgressState::Ongoing{..}) { | |
| // if this fails, we made a programming error in our state handling | |
| panic!("migration already ongoing"); | |
| } |
This extends the internal API with a
vm_progressfunction and adds avm.migration-progressHTTP endpoint including support inch-remoteviach-remote migration-processto query the latest migration progress.The two major pre-requisites were:
Now, it is possible to get information about ongoing migrations. The most interesting part is the pre-copy phase. The first version is rather coarse-grained with one update per memory iteration. More to follow.
Steps Before Merge
ch-remote