KVM HA not functioning #12139

jpt1624 · 2025-11-17T22:37:07Z

jpt1624
Nov 17, 2025

problem

Hello, I am having issues with getting HA to function with my two KVM hosts. The cluster, the two hosts, and a test virtual machine each have HA enabled.

The KVM hosts have OOB management configured using ipmitool.

For testing, I have a virtual machine with an HA supported policy running on KVM-02. I power off KVM-02 abruptly to see if the virtual machine will automatically migrate over to KVM-01.

What occurs is the following:

KVM-02 is determined to be Disconnected by cloudstack management:

Host {"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"} has the status [Disconnected].

KVM-02 is then set to the DOWN status (supposedly):

{"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"} has the status [Down].

KVM-01 then checks connectivity with KVM-02, which also returns a status of DOWN:

Neighbouring Host {"id":43,"name":"kvm-01","type":"Routing","uuid":"025ccefd-1696-43c9-9a2c-e045968d2efa"} returned status [Down] for the investigated Host {"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"}.

The shared storage volume mounted onto KVM-02 is checked for any recent writes:

Checking VM activity for Host {"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"} on storage pool [StoragePool {"id":48,"name":"Cloud-KVM-SSD-01","poolType":"NetworkFilesystem","uuid":"f8e97832-44a9-3031-aa1d-0acfc9e32648"}].

Host {"id":46,"name":"kvm-02","type":"Routing","uuid":"d3b323d6-e3bb-4d06-917a-75fba36a5adf"} does not have activity on storage pool [StoragePool {"id":48,"name":"Cloud-KVM-SSD-01","poolType":"NetworkFilesystem","uuid":"f8e97832-44a9-3031-aa1d-0acfc9e32648"}]

Also while these are occurring, the API states that the status for KVM-02 is UP:

After about 10-15 minutes, we progress to the ALERT state for KVM-02. I am not sure why it takes this many attempts because we have set this condition in the settings for 5 checks:

At the ALERT state, now the HA task tries to power OFF KVM-02 (assuming to prevent split brain prior to moving the virtual machines over):

This command fails because KVM-02 is already OFF.

Cloudstack will continue to try to power KVM-02 off until I manually issue the OOB power up command. Cloudstack's power OFF command then will work. When this happens we progress to marking KVM-02 as DOWN:

Here are the investigators configured:

versions

Cloudstack: 4.22.0.0
KVM-01: 4.22.0.0
KVM-02: 4.22.0.0

The steps to reproduce the bug

Create KVM cluster
Assign two KVM hosts under cluster
Enable HA for cluster and KVM hosts
Configure OOB management for KVM hosts
Create test VM under a KVM host with HA supported policy
Assign SimpleInvestigator, PingInvestigator, and KVMInvestigator under HA investigators order.
Power off KVM host abruptly to simulate failure scenario.

What to do about it?

Not sure if my configuration is incorrect or underlying issue is present. The behavior is confusing. Please let me know if I can provide anything else.

Thanks!

2025-11-17T22:37:10Z

boring-cyborg[bot]
bot Nov 17, 2025

Thanks for opening your first issue here! Be sure to follow the issue template!

0 replies

DaanHoogland · 2025-11-26T14:04:38Z

DaanHoogland
Nov 26, 2025
Collaborator

@jpt1624 , I think you are confusing host-ha (which reboots a host if needed/possible) and VM-ha which restarts a VM when it can reliably detect that it is down. The crux is that if a host is “down" it cannot be reliably determined that the VM is down. It might be that the host is disconnected and still allows for the VM to change its disk. This why a VM will not be restarted “just” because the host is not addressable. #splitbrain-scenario

7 replies

DaanHoogland Nov 27, 2025
Collaborator

@chunkyen , you can test VM-HA by destroying the VM on the hypervisor level directly.

jpt1624 Nov 27, 2025
Author

thanks for the replies! I did test with VM HA only (Host HA disabled) for a bit after I posted this thread. I powered off a KVM host and watched to see if the VM would migrate. The host was set to DOWN and the VM attempted to restart on the second host but ran into an error due to NFSv3 locking. I'm working on updating the shared storage to NFSv4 at the moment.

Is Host HA not supposed to be used with VM HA? From what I have gathered, Host HA introduces a fencing algorithm to help VM HA safely restart VMs on another host. From my testing with VM HA, I saw the KVMInvestigator check the shared storage status to conclude that the KVM host was actually down before attempting migration. Do I even need Host HA then? I haven't had any luck of Host HA restarting a KVM host successfully.

DaanHoogland Nov 28, 2025
Collaborator

One of the problems with VM-HA is knowing for sure that the VM is not running. Hosts-HA solves part of that problem. NFS locking would be a seperate problem and I am not sure host-HA is any help there.
as to this question: Is Host HA not supposed to be used with VM HA?
They where not designed in conjunction. I know some operators/architects even advise against it. Any opinion @rajujith ?

rajujith Nov 28, 2025
Collaborator

Yes @DaanHoogland based on the experience, VM HA and Host HA do not work together well. For most use cases, just use VM HA ( enable HA on offering). Regarding NFS locking, either avoid locking or use NFSv4, where we haven't seen this issue.

mattmonday88 Dec 1, 2025

Yes @DaanHoogland based on the experience, VM HA and Host HA do not work together well. For most use cases, just use VM HA ( enable HA on offering). Regarding NFS locking, either avoid locking or use NFSv4, where we haven't seen this issue.

Interesting. It seems there is no real consensus then. This article seems to indicate they are used in conjunction.

https://www.shapeblue.com/host-ha-for-kvm-hosts-in-cloudstack/

Even here - https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA - the statement is "For KVM HOST HA to work effectively it has to work in tandem with the existing VM HA framework"

We want to move forward with a configuration that is reliable and works but the documentation and guidance here seem to contracict each other.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KVM HA not functioning #12139

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

KVM HA not functioning #12139

Uh oh!

jpt1624 Nov 17, 2025

problem

versions

The steps to reproduce the bug

What to do about it?

Replies: 2 comments · 7 replies

Uh oh!

boring-cyborg[bot] bot Nov 17, 2025

Uh oh!

DaanHoogland Nov 26, 2025 Collaborator

Uh oh!

DaanHoogland Nov 27, 2025 Collaborator

Uh oh!

jpt1624 Nov 27, 2025 Author

Uh oh!

DaanHoogland Nov 28, 2025 Collaborator

Uh oh!

rajujith Nov 28, 2025 Collaborator

Uh oh!

mattmonday88 Dec 1, 2025

jpt1624
Nov 17, 2025

Replies: 2 comments 7 replies

boring-cyborg[bot]
bot Nov 17, 2025

DaanHoogland
Nov 26, 2025
Collaborator

DaanHoogland Nov 27, 2025
Collaborator

jpt1624 Nov 27, 2025
Author

DaanHoogland Nov 28, 2025
Collaborator

rajujith Nov 28, 2025
Collaborator