docs: explain cpu masking and migration behavior#429
docs: explain cpu masking and migration behavior#429thomas-dkmt wants to merge 1 commit intoxcp-ng:masterfrom
Conversation
Signed-off-by: Thomas Moraine <thomas.moraine@vates.tech>
|
|
||
| ## Live Migration Behavior | ||
|
|
||
| Live migration checks rely on the CPU features the VM was assigned at boot. Before migration, XCP-ng confirms the destination host supports all the VM’s required features. If even one is missing, the migration is blocked. |
There was a problem hiding this comment.
What do you mean by blocked? I think the migration will fail right?
There was a problem hiding this comment.
A cross pool migration is blocked before it is even attempted. But I think some intra-pool migrations can fail (see my comment just above). Unless they're blocked too 🤔
|
|
||
| CPU masking is **how XCP-ng controls which CPU features are visible to virtual machines**. | ||
|
|
||
| Since modern CPUs offer a wide range of features — and not all hosts in a pool support the same set — exposing unsupported features could make live migration unsafe. Without masking, a VM might end up on a host that lacks the CPU capabilities it requires, leading to crashes or unpredictable behavior. |
There was a problem hiding this comment.
and not all hosts in a pool support the same set
Actually, it is recommended that they do. It's when they differ that CPU masking becomes a question.
|
|
||
| To prevent this, XCP-ng uses Xen to present each VM with a consistent, safe set of CPU features. From the VM's perspective, the CPU appears virtualized (it may not perfectly match the host's physical CPU). Certain features are hidden (or *masked*) to ensure compatibility across all hosts in the pool. | ||
|
|
||
| CPU masking isn’t about performance. **It’s a safety mechanism**. Its primary role is to enable smooth migration while preventing guest crashes or unpredictable behavior. |
There was a problem hiding this comment.
| CPU masking isn’t about performance. **It’s a safety mechanism**. Its primary role is to enable smooth migration while preventing guest crashes or unpredictable behavior. | |
| CPU masking isn't about performance. **It's a safety mechanism**. Its primary role is to enable smooth migration while preventing guest crashes or unpredictable behavior. |
We don't use this kind of quotes, right? There may be more in the document.
| :::tip | ||
| Because of this, **CPU masking can't be adjusted on the fly**. Removing features from a running VM would almost certainly cause it to crash. | ||
|
|
||
| Even a simple reboot won’t update the CPU features if the VM’s state is preserved. To apply any changes, you must fully shut down the VM and restart it. |
There was a problem hiding this comment.
At this stage of the document, I don't understand why this mention is relevant. As a reader, I'm wondering "why would I want to change the CPU features of the running VM? Is that even possible?"
|
|
||
| ## Pool-level Masking | ||
|
|
||
| In XCP-ng, **CPU masking works at the pool level**. The pool exposes a shared set of CPU features — the lowest common denominator across all hosts. |
There was a problem hiding this comment.
I, and several other developers, tend to avoid the — character in technical markdown documents. I know they're correct, more than a simple dash (-) but they more and more scream "I used AI to write or reword my document" (even if that's not the case), and I think that's not a message we want to send in our technical docs.
| When a new host joins the pool, its CPU features are compared against those of existing members: | ||
|
|
||
| - If the new host has a **newer CPU**, its extra features are masked to align with the pool level. | ||
| - If the new host has an **older CPU**, the pool level drops, and some features are masked for all hosts in the pool. |
There was a problem hiding this comment.
This is where I expected what I think is the most important thing to know about CPU masking. I actually thought that it was the point of the document, at first (the rest being transparent and not supposed to cause any issues).
When the CPU features of an existing pool are reduced, running VMs are still using the previous set of CPU flags. Which means that we now have a host in the pool that is not suitable for receiving the running VMs during a live migration.
CCing @xcp-ng/xapi-network so that they can confirm what the actual behaviour is when attempting to migrate a previously running VM just after the pool CPU features are reduced by adding an older host to it.
AFAIK, the VM crashes. But maybe XAPI is smart and refuses to migrate to the new pool member until the features the VM was started with are reduced (by stopping and restarting the VM).
There was a problem hiding this comment.
xapi is smart and will refuse to migrate the VM that's using some features a host doesn't have. one can override this with force=true (really not recommended, because yes, it will crash)
| Since the pool level adjusts dynamically, changes in pool membership directly impact migration: | ||
|
|
||
| - **Adding a host with fewer features** may lower the pool baseline. Running VMs keep their original feature set but might lose the ability to migrate to the new host (though they can still move to other compatible hosts). | ||
| - **Removing a host with older features** can raise the pool baseline. New VMs will benefit immediately, but existing ones must be fully shut down and restarted to adopt the updated features. |
There was a problem hiding this comment.
This doesn't look related to live migration.
|
|
||
| ## Migrating Across CPU Vendors | ||
|
|
||
| Cross-vendor VM migration — such as moving from Intel to AMD — presents unique challenges. Live migration isn’t feasible in these cases, as the CPU architectures and feature sets differ fundamentally. |
There was a problem hiding this comment.
Add that it's simply not supported and that we'll refuse to move the VMs in such scenarios? (if that's true)
|
|
||
| This guide explains how CPU masking works in XCP-ng and its impact on virtual machine migration, particularly in environments with mixed CPU architectures or generations. | ||
|
|
||
| The goal is to help you understand why some migrations succeed while others fail, what compatibility guarantees XCP-ng provides at the pool level, and how to safely migrate your VMs during hardware upgrades. This applies whether you're moving across different CPU vendors or between generations of the same vendor. |
There was a problem hiding this comment.
The goal is to help you understand why some migrations succeed while others fail
I haven't found information about this in the document. The way it's written, I got information about what migrations are allowed, and some arent. However there may be scenarios where migrations actually fail due to CPU masking issues. If that's the case, they should be detailed.
There was a problem hiding this comment.
It will be a valuable addition when finalized, but I'm uncomfortable regarding the current size of the document, its structure, and some imprecisions. The forum posts are far shorter, they may lack some structure and prerequisites (it's good to explain what CPU masking is as an introduction as you have done), but I have the overall feeling that they were more straight to the point. I don't know if it's a consequence of AI tools (they do tend to be verbose) or the way you think is the best to convey the information (pedagogy can require to expand on some topics sometimes), but I suggest anyway to:
- Make sure the information is accurate (the XAPI team can help us).
- Define what we want to tell users and why, and build around this.
Whether a new guide in docs/guide is the right location is also debatable. That's not the most organized place, and to me the more we can expand the thematic sections, the better. Here for example, main concepts may be relevant to docs about VMs in general, or to the "compute" side of things. But maybe the documentation structure itself is a topic for another time.
A more general note: on some topics, and I think it's one of them, rather than reviewing a big document that raises various concerns and takes a lot of time to review (as you can tell from the time of the review. It's not easy to review a document which is, like a 6/10, not bad, objectively useful in some way, but not as good as it good have been by just taking a more iterative approach, and find a way to steer it to the right direction without being a tyrant who dismisses your hard work), we would be more efficient by starting with an intention note:
- What topic I want to address
- What are the key points
- Here's the structure I envision
- Here's where I intend to add this in the documentation
Then the XCP-ng team could review that and help you before it's written, not after. How about trying this in the future?
|
|
||
| This method ensures a smooth transition, maintaining compatibility and performance. | ||
|
|
||
| ## Practical recommendations |
There was a problem hiding this comment.
This whole section is a bit awkward. It contains a big tip callout and some sort of general conclusion that brings nothing new and even looks like an introduction.
| @@ -0,0 +1,96 @@ | |||
| # CPU Masking | |||
There was a problem hiding this comment.
I don't think "masking" is the word we want to use here (note that it's the word used by the user asking the question on the forum, not anyone else):
- There is another meaning to "cpu masks" - hard/soft affinity, where one sets which cores will vcpus be able to use. I'd rather keep the "mask" meaning for this usage.
- The new way of doing "CPU masking" is called "CPU feature levelling" - because xen/xapi automatically calculate a level that will better serve a particular purpose, it's not about manually "masking" something out anymore.
If "CPU masking" is a term of art in VMWare/Proxmox world, then I guess this could be useful to note somewhere, and how we (potentially) differ from their approach.
| When a new host joins the pool, its CPU features are compared against those of existing members: | ||
|
|
||
| - If the new host has a **newer CPU**, its extra features are masked to align with the pool level. | ||
| - If the new host has an **older CPU**, the pool level drops, and some features are masked for all hosts in the pool. |
There was a problem hiding this comment.
xapi is smart and will refuse to migrate the VM that's using some features a host doesn't have. one can override this with force=true (really not recommended, because yes, it will crash)
This pull request enriches the XCP-ng documentation, with a new guide to explain CPU masking. We define the concept, how it's used in XCP-ng, and give some practical advice for adding or removing hosts and migrating VMs, in the context of different CPU architectures.
The guide mostly leverages information shared in this discussion on the XCP-ng forum.