-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Direct agents rebalance improvements with multiple management server nodes #10674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct agents rebalance improvements with multiple management server nodes #10674
Conversation
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.20 #10674 +/- ##
============================================
+ Coverage 16.13% 16.14% +0.01%
- Complexity 13220 13230 +10
============================================
Files 5651 5651
Lines 496740 496747 +7
Branches 60183 60184 +1
============================================
+ Hits 80148 80213 +65
+ Misses 407674 407600 -74
- Partials 8918 8934 +16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12989 |
|
@blueorangutan test matrix |
|
@sureshanaparti a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests |
weizhouapache
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
|
[SF] Trillian Build Failed (tid-12922) |
|
[SF] Trillian test result (tid-12923)
|
|
[SF] Trillian test result (tid-12924)
|
|
[SF] Trillian test result (tid-12925)
|
|
@blueorangutan test matrix |
|
@sureshanaparti a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests |
|
[SF] Trillian Build Failed (tid-12962) |
|
[SF] Trillian Build Failed (tid-12960) |
|
[SF] Trillian test result (tid-12961)
|
|
[SF] Trillian test result (tid-12963)
|
|
@blueorangutan test ol8 vmware-70u3 |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + vmware-70u3) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-12973)
|
c9bb527 to
f398273
Compare
|
@blueorangutan package |
|
@sureshanaparti can you check failures |
|
@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13157 |
|
[SF] Trillian test result (tid-13109)
|
shwstppr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
f398273 to
bb44e18
Compare
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13211 |
|
@blueorangutan test |
|
@rohityadavcloud a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-13185)
|
…nodes Sometimes hypervisor hosts (direct agents) stuck with Disconnect state during agent rebalancing activity across multiple management server nodes. This issue was noticed during frequent restart of the management server nodes in the cluster. When there are multiple management server nodes in a cluster, if one or more nodes are shutdown/start/restart, CloudStack will rebalance the hosts among the remaining nodes or move the nodes to the newly joined management server nodes. During the rebalancing period multiple operations could happen including: - DirectAgentScan at interval of configured direct.agent.scan.interval - AgentRebalanceScan to identify and schedule rebalance agents - TransferAgentScan to transfer the host from original owner to future owner **Current Rebalance behavior** 1. For hosts that have AgentAttache && not forForward but in Disconnect state, CloudStack simply ignore these hosts without trying to ping again or update the status of the host. 2. For hosts that have AgentAttache && forForward, CloudStack removes the agent but still try to loadDirectlyConnectedHost. **Improved Rebalance behavior** During DirectAgentScan: scanDirectAgentToLoad(), identify hosts that for self-managed hosts that are in Disconnect state (disconnected after pingtimeout). 1. For hosts that have AgentAttache and is forForward, CloudStack should remove the agent 2. For hosts that have AgentAttache and is not forForward but in Disconnect state, CloudStack should try to investigate and update the status to Up if host is pingable. 3. For hosts that don't have AgentAttache, CloudStack should try to loadDirectlyConnectedHost.
bb44e18 to
cfa0120
Compare
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13257 |
|
@sureshanaparti can you add some test description for those that want to try to break the change? |
|
@blueorangutan test |
|
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-13213)
|
borisstoyanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…nodes (apache#10674) Sometimes hypervisor hosts (direct agents) stuck with Disconnect state during agent rebalancing activity across multiple management server nodes. This issue was noticed during frequent restart of the management server nodes in the cluster. When there are multiple management server nodes in a cluster, if one or more nodes are shutdown/start/restart, CloudStack will rebalance the hosts among the remaining nodes or move the nodes to the newly joined management server nodes. During the rebalancing period multiple operations could happen including: - DirectAgentScan at interval of configured direct.agent.scan.interval - AgentRebalanceScan to identify and schedule rebalance agents - TransferAgentScan to transfer the host from original owner to future owner **Current Rebalance behavior** 1. For hosts that have AgentAttache && not forForward but in Disconnect state, CloudStack simply ignore these hosts without trying to ping again or update the status of the host. 2. For hosts that have AgentAttache && forForward, CloudStack removes the agent but still try to loadDirectlyConnectedHost. **Improved Rebalance behavior** During DirectAgentScan: scanDirectAgentToLoad(), identify hosts that for self-managed hosts that are in Disconnect state (disconnected after pingtimeout). 1. For hosts that have AgentAttache and is forForward, CloudStack should remove the agent 2. For hosts that have AgentAttache and is not forForward but in Disconnect state, CloudStack should try to investigate and update the status to Up if host is pingable. 3. For hosts that don't have AgentAttache, CloudStack should try to loadDirectlyConnectedHost.
Description
This PR improves the current direct agents rebalancing activity. Sometimes hypervisor hosts (direct agents) stuck with Disconnect state during agent rebalancing activity across multiple management server nodes. This issue was noticed during frequent restart of the management server nodes in the cluster.
When there are multiple management server nodes in a cluster, if one or more nodes are shutdown/start/restart, CloudStack will rebalance the hosts among the remaining nodes or move the nodes to the newly joined management server nodes. During the rebalancing period multiple operations could happen including:
Current Rebalance behavior
Improved Rebalance behavior
During DirectAgentScan: scanDirectAgentToLoad(), identify hosts that for self-managed hosts that are in Disconnect state (disconnected after pingtimeout).
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Restart of management server nodes in cluster and few VMware hypervisor hosts.
How did you try to break this feature and the system with this change?