-
Notifications
You must be signed in to change notification settings - Fork 1.2k
KVM: return null state instead of Disconnected when investigate a host without NFS #10515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KVM: return null state instead of Disconnected when investigate a host without NFS #10515
Conversation
sureshanaparti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clgtm
|
@blueorangutan package |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.19 #10515 +/- ##
============================================
- Coverage 15.17% 15.16% -0.01%
+ Complexity 11332 11328 -4
============================================
Files 5414 5414
Lines 474802 474802
Branches 57909 57909
============================================
- Hits 72028 72008 -20
- Misses 394718 394742 +24
+ Partials 8056 8052 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@blueorangutan package |
|
@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12686 |
|
@blueorangutan test |
|
@rohityadavcloud a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-12603)
|
kiranchavala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Verified the issue manually by executing the following steps
- Create a cloudstack env with 2 hosts and no nfs primary storages.
- On one of the kvm host configure ha and enable HA.
- Add a firewall rule which drops the packets on port 8250
iptables -I OUTPUT -p tcp -m tcp --dport 8250 -j DROP
- Check the management server logs
Before fix,
Cloudstack doesn't pick up the HypervInvestigator VMwareInvestigator, ping investigator.
2025-03-06 13:36:30,022 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Investigating why host Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"} has disconnected with event PingTimeout
2025-03-06 13:36:30,023 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) checking if agent (Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}) is alive
2025-03-06 13:36:30,025 DEBUG [c.c.a.t.Request] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: Sending { Cmd , MgmtId: 32986892337576, via: 1(ref-trl-8094-k-mol8-kiran-chavala-kvm1), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }
2025-03-06 13:37:10,041 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: Waiting some more time because this is the current command
2025-03-06 13:37:10,041 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: Waiting some more time because this is the current command
2025-03-06 13:37:10,042 WARN [c.c.a.m.AgentAttache] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: Timed out on Seq 1-8864491441548689460: { Cmd , MgmtId: 32986892337576, via: 1(ref-trl-8094-k-mol8-kiran-chavala-kvm1), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }
2025-03-06 13:37:10,047 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: Cancelling.
2025-03-06 13:37:10,047 WARN [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Operation timed out: Commands 8864491441548689460 to Host 1 timed out after 100
2025-03-06 13:37:10,067 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) SimpleInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:37:10,067 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) XenServerInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:37:10,083 WARN [c.c.h.KVMInvestigator] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Agent investigation was requested on host Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}, but host does not support investigation because it has no NFS storage. Skipping investigation.
2025-03-06 13:37:10,083 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) KVMInvestigator was able to determine host Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"} is in Disconnected
2025-03-06 13:37:10,083 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) The agent from host Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"} state determined is Disconnected
2025-03-06 13:37:10,083 WARN [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Agent is disconnected but the host is still up: Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"} state: Enabled
After fix
Cloudstack picks up the HypervInvestigator VMwareInvestigator, ping investigator.
[root@ol8 ~]# cat /var/log/cloudstack/management/management-server.log |grep -i "logid:b39c7f05"
2025-03-06 13:08:59,485 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Investigating why host Host {"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"} has disconnected with event PingTimeout
2025-03-06 13:08:59,485 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) checking if agent (Host {"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"}) is alive
2025-03-06 13:08:59,487 DEBUG [c.c.a.t.Request] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: Sending { Cmd , MgmtId: 32987949302884, via: 2(ref-trl-8087-k-mol8-kiran-chavala-kvm2), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }
2025-03-06 13:09:49,487 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: Waiting some more time because this is the current command
2025-03-06 13:10:39,487 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: Waiting some more time because this is the current command
2025-03-06 13:10:39,488 WARN [c.c.a.m.AgentAttache] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: Timed out on Seq 2-5748563449361727501: { Cmd , MgmtId: 32987949302884, via: 2(ref-trl-8087-k-mol8-kiran-chavala-kvm2), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }
2025-03-06 13:10:39,488 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: Cancelling.
2025-03-06 13:10:39,489 WARN [c.c.a.m.AgentManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Operation timed out: Commands 5748563449361727501 to Host 2 timed out after 100
2025-03-06 13:10:39,491 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) SimpleInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:10:39,491 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) XenServerInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:10:39,494 WARN [c.c.h.KVMInvestigator] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Agent investigation was requested on host Host {"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"}, but host does not support investigation because it has no NFS storage. Skipping investigation.
2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) KVMInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) HypervInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) VMwareInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:10:39,495 DEBUG [c.c.h.UserVmDomRInvestigator] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) checking if agent (Host {"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"}) is alive
2025-03-06 13:10:39,496 DEBUG [c.c.h.UserVmDomRInvestigator] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) sending ping from (Host {"id":1,"name":"ol8.localdomain","type":"Routing","uuid":"c0fd498b-e0ff-433c-a68d-698a982a5f6f"}) to agent's host ip address (10.0.35.136)
2025-03-06 13:10:39,497 DEBUG [c.c.a.t.Request] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 1-728457239727181052: Sending { Cmd , MgmtId: 32987949302884, via: 1(ol8.localdomain), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"10.0.35.136","wait":"20","bypassHostMaintenance":"false"}}] }
2025-03-06 13:10:39,511 DEBUG [c.c.a.t.Request] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 1-728457239727181052: Received: { Ans: , MgmtId: 32987949302884, via: 1(ol8.localdomain), Ver: v1, Flags: 10, { Answer } }
2025-03-06 13:10:39,512 DEBUG [c.c.h.AbstractInvestigatorImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) host (10.0.35.136) has been successfully pinged, returning that host is up
2025-03-06 13:10:39,512 DEBUG [c.c.h.UserVmDomRInvestigator] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) ping from (Host {"id":1,"name":"ol8.localdomain","type":"Routing","uuid":"c0fd498b-e0ff-433c-a68d-698a982a5f6f"}) to agent's host ip address (10.0.35.136) successful, returning that agent is disconnected
2025-03-06 13:10:39,512 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) PingInvestigator was able to determine host Host {"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"} is in Disconnected
great, thanks @kiranchavala for testing ! |
Description
Currently when kvm host does not have NFS, it is determined as Disconnected during agent/vm investigation.
The other investigators are not performed.
This PR fixes the issue so that the other investigators will be performed.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Below is an example of the investigation process with this PR
(on the kvm host, I added a firewall rule to drop the packets to port 8250 of management server)

How did you try to break this feature and the system with this change?