Direct agents rebalance improvements with multiple management server nodes #10674

sureshanaparti · 2025-04-09T10:26:04Z

Description

This PR improves the current direct agents rebalancing activity. Sometimes hypervisor hosts (direct agents) stuck with Disconnect state during agent rebalancing activity across multiple management server nodes. This issue was noticed during frequent restart of the management server nodes in the cluster.

When there are multiple management server nodes in a cluster, if one or more nodes are shutdown/start/restart, CloudStack will rebalance the hosts among the remaining nodes or move the nodes to the newly joined management server nodes. During the rebalancing period multiple operations could happen including:

DirectAgentScan at interval of configured direct.agent.scan.interval
AgentRebalanceScan to identify and schedule rebalance agents
TransferAgentScan to transfer the host from original owner to future owner

Current Rebalance behavior

For hosts that have AgentAttache && not forForward but in Disconnect state, CloudStack simply ignore these hosts without trying to ping again or update the status of the host.
For hosts that have AgentAttache && forForward, CloudStack removes the agent but still try to loadDirectlyConnectedHost.

Improved Rebalance behavior
During DirectAgentScan: scanDirectAgentToLoad(), identify hosts that for self-managed hosts that are in Disconnect state (disconnected after pingtimeout).

For hosts that have AgentAttache and is forForward, CloudStack should remove the agent
For hosts that have AgentAttache and is not forForward but in Disconnect state, CloudStack should try to investigate and update the status to Up if host is pingable.
For hosts that don't have AgentAttache, CloudStack should try to loadDirectlyConnectedHost.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI
test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Restart of management server nodes in cluster and few VMware hypervisor hosts.

How did you try to break this feature and the system with this change?

sureshanaparti · 2025-04-09T10:28:23Z

@blueorangutan package

blueorangutan · 2025-04-09T10:30:03Z

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

codecov · 2025-04-09T10:33:34Z

Codecov Report

Attention: Patch coverage is 34.48276% with 19 lines in your changes missing coverage. Please review.

Project coverage is 16.14%. Comparing base (f6d0590) to head (cfa0120).
Report is 25 commits behind head on 4.20.

Files with missing lines	Patch %	Lines
...cloud/agent/manager/ClusteredAgentManagerImpl.java	34.48%	19 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               4.20   #10674      +/-   ##
============================================
+ Coverage     16.13%   16.14%   +0.01%     
- Complexity    13220    13230      +10     
============================================
  Files          5651     5651              
  Lines        496740   496747       +7     
  Branches      60183    60184       +1     
============================================
+ Hits          80148    80213      +65     
+ Misses       407674   407600      -74     
- Partials       8918     8934      +16

Flag	Coverage Δ
uitests	`4.00% <ø> (ø)`
unittests	`16.99% <34.48%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

blueorangutan · 2025-04-09T11:33:12Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12989

sureshanaparti · 2025-04-09T12:14:06Z

@blueorangutan test matrix

blueorangutan · 2025-04-09T12:16:04Z

@sureshanaparti a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests

weizhouapache

code lgtm

blueorangutan · 2025-04-09T13:27:45Z

[SF] Trillian Build Failed (tid-12922)

blueorangutan · 2025-04-10T22:41:34Z

[SF] Trillian test result (tid-12923)
Environment: kvm-ubuntu22 (x2), Advanced Networking with Mgmt server u22
Total time taken: 121473 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12923-kvm-ubuntu22.zip
Smoke tests completed. 109 look OK, 32 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_events_resource	`Error`	7.86	test_events_resource.py
test_01_events_resource	`Error`	7.86	test_events_resource.py
ContextSuite context=TestAccounts>:setup	`Error`	0.00	test_accounts.py
ContextSuite context=TestAddVmToSubDomain>:setup	`Error`	0.00	test_accounts.py
test_DeleteDomain	`Error`	15.16	test_accounts.py
test_forceDeleteDomain	`Failure`	16.33	test_accounts.py
ContextSuite context=TestRemoveUserFromAccount>:setup	`Error`	14.84	test_accounts.py
ContextSuite context=TestTemplateHierarchy>:setup	`Error`	1536.10	test_accounts.py
ContextSuite context=TestDeployVmWithUserData>:setup	`Error`	0.00	test_deploy_vm_with_userdata.py
ContextSuite context=TestDeployVmWithAffinityGroup>:setup	`Error`	0.00	test_affinity_groups_projects.py
test_replace_acl_of_network	`Error`	5.74	test_global_acls.py
ContextSuite context=TestDeployVmWithVariedPlanners>:setup	`Error`	0.00	test_deploy_vms_with_varied_deploymentplanners.py
ContextSuite context=TestAnnotations>:setup	`Error`	0.00	test_annotations.py
ContextSuite context=TestDeployVirtioSCSIVM>:setup	`Error`	0.00	test_deploy_virtio_scsi_vm.py
ContextSuite context=TestInternalLb>:setup	`Error`	0.00	test_internal_lb.py
ContextSuite context=TestDeployVMFromISO>:setup	`Error`	0.00	test_deploy_vm_iso.py
ContextSuite context=TestDeployVMFromISOWithUefi>:setup	`Error`	0.00	test_deploy_vm_iso_uefi.py
ContextSuite context=TestIpv4Routing>:setup	`Error`	0.00	test_ipv4_routing.py
test_00_deploy_vm_root_resize	`Error`	1.54	test_deploy_vm_root_resize.py
ContextSuite context=TestDeployVMsInParallel>:setup	`Error`	0.00	test_deploy_vms_in_parallel.py
ContextSuite context=TestRemoteDiagnostics>:setup	`Error`	0.00	test_diagnostics.py
test_01_deploy_vm_from_direct_download_template_nfs_storage	`Error`	1.48	test_direct_download.py
ContextSuite context=TestDirectDownloadTemplates>:teardown	`Error`	1.09	test_direct_download.py
test_01_1_create_iso_with_checksum_sha1_negative	`Error`	66.66	test_iso.py
test_01_create_iso_with_checksum_sha1	`Error`	66.58	test_iso.py
test_01_create_iso_with_checksum_sha1	`Error`	66.58	test_iso.py
test_02_1_create_iso_with_checksum_sha256_negative	`Error`	66.55	test_iso.py
test_02_create_iso_with_checksum_sha256	`Error`	66.57	test_iso.py
test_02_create_iso_with_checksum_sha256	`Error`	66.57	test_iso.py
test_03_1_create_iso_with_checksum_md5_negative	`Error`	66.55	test_iso.py
test_03_create_iso_with_checksum_md5	`Error`	66.53	test_iso.py
test_03_create_iso_with_checksum_md5	`Error`	66.53	test_iso.py
test_04_create_iso_with_no_checksum	`Error`	66.56	test_iso.py
test_04_create_iso_with_no_checksum	`Error`	66.56	test_iso.py
test_01_create_iso	`Failure`	1519.95	test_iso.py
ContextSuite context=TestISO>:setup	`Error`	3038.61	test_iso.py
ContextSuite context=TestDomainsServiceOfferings>:setup	`Error`	1520.29	test_domain_service_offerings.py
test_03_create_vpc_domain_vpc_offering	`Error`	17.36	test_domain_vpc_offerings.py
test_updating_nics_on_two_shared_networks	`Error`	1.74	test_gateway_on_shared_networks.py
ContextSuite context=TestGatewayOnSharedNetwork>:teardown	`Error`	3.95	test_gateway_on_shared_networks.py
ContextSuite context=TestHostControlState>:setup	`Error`	53.81	test_host_control_state.py
test_01_browser_migrate_template	`Error`	65.86	test_image_store_object_migration.py
ContextSuite context=TestImportAndUnmanageVolumes>:setup	`Error`	0.00	test_import_unmanage_volumes.py
test_01_invalid_upgrade_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster	`Failure`	0.00	test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster	`Failure`	0.00	test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster	`Failure`	0.00	test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster	`Failure`	0.00	test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle	`Error`	0.00	test_kubernetes_clusters.py
test_01_add_delete_kubernetes_supported_version	`Error`	1802.10	test_kubernetes_supported_versions.py
ContextSuite context=TestListIdsParams>:setup	`Error`	0.00	test_list_ids_parameter.py
test_oobm_multiple_mgmt_server_ownership	`Failure`	31.84	test_outofbandmanagement.py
ContextSuite context=TestSnapshotRootDisk>:setup	`Error`	0.00	test_snapshots.py
ContextSuite context=TestISOUsage>:setup	`Error`	0.00	test_usage.py
test_10_attachAndDetach_iso	`Failure`	1513.77	test_vm_life_cycle.py
test_11_destroy_vm_and_volumes	`Error`	1.42	test_vm_life_cycle.py
test_12_start_vm_multiple_volumes_allocated	`Error`	71.12	test_vm_life_cycle.py
test_13_destroy_and_expunge_vm	`Error`	5.08	test_vm_life_cycle.py
ContextSuite context=TestVMSchedule>:setup	`Error`	0.00	test_vm_schedule.py
ContextSuite context=TestVmSnapshot>:setup	`Error`	7.06	test_vm_snapshots.py

blueorangutan · 2025-04-11T00:46:55Z

[SF] Trillian test result (tid-12924)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 128635 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12924-vmware-70u3.zip
Smoke tests completed. 100 look OK, 41 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
ContextSuite context=TestRemoteDiagnostics>:setup	`Error`	0.00	test_diagnostics.py
ContextSuite context=TestAccounts>:setup	`Error`	0.00	test_accounts.py
ContextSuite context=TestAddVmToSubDomain>:setup	`Error`	0.00	test_accounts.py
test_DeleteDomain	`Error`	13.21	test_accounts.py
test_forceDeleteDomain	`Failure`	14.28	test_accounts.py
ContextSuite context=TestRemoveUserFromAccount>:setup	`Error`	14.39	test_accounts.py
ContextSuite context=TestTemplateHierarchy>:setup	`Error`	1535.53	test_accounts.py
ContextSuite context=TestDeployVmWithUserData>:setup	`Error`	0.00	test_deploy_vm_with_userdata.py
test_01_events_resource	`Error`	7.22	test_events_resource.py
test_01_events_resource	`Error`	7.22	test_events_resource.py
ContextSuite context=TestDeployVmWithAffinityGroup>:setup	`Error`	0.00	test_affinity_groups_projects.py
test_DeployVmAffinityGroup	`Error`	13.84	test_affinity_groups.py
test_DeployVmAntiAffinityGroup	`Error`	11.69	test_affinity_groups.py
test_01_1_create_iso_with_checksum_sha1_negative	`Error`	66.52	test_iso.py
test_01_create_iso_with_checksum_sha1	`Error`	66.50	test_iso.py
test_01_create_iso_with_checksum_sha1	`Error`	66.50	test_iso.py
test_02_1_create_iso_with_checksum_sha256_negative	`Error`	66.51	test_iso.py
test_02_create_iso_with_checksum_sha256	`Error`	66.48	test_iso.py
test_02_create_iso_with_checksum_sha256	`Error`	66.48	test_iso.py
test_03_1_create_iso_with_checksum_md5_negative	`Error`	66.48	test_iso.py
test_03_create_iso_with_checksum_md5	`Error`	66.49	test_iso.py
test_03_create_iso_with_checksum_md5	`Error`	66.49	test_iso.py
test_04_create_iso_with_no_checksum	`Error`	66.49	test_iso.py
test_04_create_iso_with_no_checksum	`Error`	66.50	test_iso.py
test_01_create_iso	`Failure`	1518.19	test_iso.py
ContextSuite context=TestISO>:setup	`Error`	3036.38	test_iso.py
ContextSuite context=TestAnnotations>:setup	`Error`	0.00	test_annotations.py
test_replace_acl_of_network	`Error`	3.60	test_global_acls.py
ContextSuite context=TestMultipleVolumeAttach>:setup	`Error`	0.00	test_attach_multiple_volumes.py
ContextSuite context=TestDeployVmWithVariedPlanners>:setup	`Error`	0.00	test_deploy_vms_with_varied_deploymentplanners.py
ContextSuite context=TestConsoleEndpoint>:setup	`Error`	0.00	test_console_endpoint.py
test_3d_gpu_support	`Error`	1521.71	test_deploy_vgpu_enabled_vm.py
ContextSuite context=TestInternalLb>:setup	`Error`	0.00	test_internal_lb.py
test_05_deploy_vm_with_extraconfig_vmware	`Error`	15.71	test_deploy_vm_extra_config_data.py
ContextSuite context=TestDeployVMFromISO>:setup	`Error`	0.00	test_deploy_vm_iso.py
test_00_deploy_vm_root_resize	`Error`	21.98	test_deploy_vm_root_resize.py
ContextSuite context=TestDeployVMsInParallel>:setup	`Error`	0.00	test_deploy_vms_in_parallel.py
ContextSuite context=TestIpv4Routing>:setup	`Error`	0.00	test_ipv4_routing.py
test_01_vm_with_thin_disk_offering	`Error`	11.57	test_disk_provisioning_types.py
test_02_vm_with_fat_disk_offering	`Error`	11.76	test_disk_provisioning_types.py
test_03_vm_with_sparse_disk_offering	`Error`	11.92	test_disk_provisioning_types.py
ContextSuite context=TestDomainsServiceOfferings>:setup	`Error`	1519.54	test_domain_service_offerings.py
test_03_create_vpc_domain_vpc_offering	`Error`	14.24	test_domain_vpc_offerings.py
test_updating_nics_on_two_shared_networks	`Error`	1.56	test_gateway_on_shared_networks.py
ContextSuite context=TestGatewayOnSharedNetwork>:teardown	`Error`	3.73	test_gateway_on_shared_networks.py
ContextSuite context=TestHostControlState>:setup	`Error`	6.54	test_host_control_state.py
test_01_browser_migrate_template	`Error`	65.72	test_image_store_object_migration.py
test_01_invalid_upgrade_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster	`Failure`	0.00	test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster	`Failure`	0.00	test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster	`Failure`	0.00	test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster	`Failure`	0.00	test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle	`Error`	0.01	test_kubernetes_clusters.py
test_01_add_delete_kubernetes_supported_version	`Error`	1802.15	test_kubernetes_supported_versions.py
ContextSuite context=TestListIdsParams>:setup	`Error`	0.00	test_list_ids_parameter.py
ContextSuite context=TestPrivateGwACL>:setup	`Error`	0.00	test_privategw_acl.py
ContextSuite context=TestListVolumes>:setup	`Error`	0.00	test_list_volumes.py
ContextSuite context=TestLoadBalance>:setup	`Error`	0.00	test_loadbalance.py
test_04_deploy_vm_for_other_user_and_test_vm_operations	`Error`	146.01	test_network_permissions.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup	`Error`	1524.32	test_network.py
test_CRUD_operations_userdata	`Error`	1521.91	test_register_userdata.py
test_deploy_vm_with_registered_userdata	`Error`	8.33	test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_allow	`Error`	7.81	test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_append	`Error`	8.05	test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_deny	`Error`	8.84	test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_params	`Error`	7.90	test_register_userdata.py
test_link_and_unlink_userdata_to_template	`Error`	12.82	test_register_userdata.py
test_user_userdata_crud	`Error`	10.05	test_register_userdata.py
test_01_restore_vm	`Error`	19.38	test_restore_vm.py
test_02_restore_vm_with_disk_offering	`Error`	15.81	test_restore_vm.py
test_03_restore_vm_with_disk_offering_custom_size	`Error`	12.79	test_restore_vm.py
test_04_restore_vm_allocated_root	`Error`	20.20	test_restore_vm.py
ContextSuite context=TestRestoreVM>:teardown	`Error`	30.69	test_restore_vm.py
ContextSuite context=TestRouterDHCPOpts>:setup	`Error`	166.93	test_router_dhcphosts.py
ContextSuite context=TestRouterDns>:setup	`Error`	0.00	test_router_dns.py
ContextSuite context=TestIsolatedNetworks>:setup	`Error`	0.00	test_routers_network_ops.py
ContextSuite context=TestRedundantIsolateNetworks>:setup	`Error`	0.00	test_routers_network_ops.py
test_01_scale_vm	`Error`	1.45	test_scale_vm.py
test_02_scale_vm_negative_offering_disable_scaling	`Error`	1.31	test_scale_vm.py
test_03_scale_vm_negative_vm_disable_scaling	`Error`	1.30	test_scale_vm.py
test_04_scale_vm_with_user_account	`Error`	10.72	test_scale_vm.py
test_05_scale_vm_dont_allow_disk_offering_change	`Error`	1.45	test_scale_vm.py
ContextSuite context=TestServiceOfferings>:setup	`Error`	1516.70	test_service_offerings.py
test_02_restore_vm_strict_tags_failure	`Error`	58.38	test_vm_strict_host_tags.py

blueorangutan · 2025-04-11T04:15:31Z

[SF] Trillian test result (tid-12925)
Environment: xcpng82 (x2), Advanced Networking with Mgmt server ol9
Total time taken: 141630 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12925-xcpng82.zip
Smoke tests completed. 108 look OK, 33 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_events_resource	`Error`	1515.94	test_events_resource.py
ContextSuite context=TestAccounts>:setup	`Error`	0.00	test_accounts.py
ContextSuite context=TestAddVmToSubDomain>:setup	`Error`	0.00	test_accounts.py
test_DeleteDomain	`Error`	15.38	test_accounts.py
test_forceDeleteDomain	`Failure`	12.93	test_accounts.py
ContextSuite context=TestRemoveUserFromAccount>:setup	`Error`	14.00	test_accounts.py
ContextSuite context=TestTemplateHierarchy>:setup	`Error`	1534.03	test_accounts.py
ContextSuite context=TestDeployVmWithAffinityGroup>:setup	`Error`	0.00	test_affinity_groups_projects.py
test_replace_acl_of_network	`Error`	4.52	test_global_acls.py
ContextSuite context=TestAnnotations>:setup	`Error`	0.00	test_annotations.py
ContextSuite context=TestDeployVmWithUserData>:setup	`Error`	0.00	test_deploy_vm_with_userdata.py
ContextSuite context=TestMultipleVolumeAttach>:setup	`Error`	0.00	test_attach_multiple_volumes.py
test_01_condensed_drs_algorithm	`Failure`	174.62	test_cluster_drs.py
test_02_balanced_drs_algorithm	`Failure`	178.34	test_cluster_drs.py
ContextSuite context=TestInternalLb>:setup	`Error`	0.00	test_internal_lb.py
ContextSuite context=TestDeployVMFromISO>:setup	`Error`	0.00	test_deploy_vm_iso.py
ContextSuite context=TestIpv4Routing>:setup	`Error`	0.00	test_ipv4_routing.py
test_00_deploy_vm_root_resize	`Error`	1.45	test_deploy_vm_root_resize.py
ContextSuite context=TestDeployVmWithVariedPlanners>:setup	`Error`	0.00	test_deploy_vms_with_varied_deploymentplanners.py
ContextSuite context=TestDeployVMsInParallel>:setup	`Error`	0.00	test_deploy_vms_in_parallel.py
ContextSuite context=TestRemoteDiagnostics>:setup	`Error`	0.00	test_diagnostics.py
test_01_1_create_iso_with_checksum_sha1_negative	`Error`	66.53	test_iso.py
test_01_create_iso_with_checksum_sha1	`Error`	66.52	test_iso.py
test_01_create_iso_with_checksum_sha1	`Error`	66.52	test_iso.py
test_02_1_create_iso_with_checksum_sha256_negative	`Error`	66.52	test_iso.py
test_02_create_iso_with_checksum_sha256	`Error`	66.51	test_iso.py
test_02_create_iso_with_checksum_sha256	`Error`	66.51	test_iso.py
test_03_1_create_iso_with_checksum_md5_negative	`Error`	66.50	test_iso.py
test_03_create_iso_with_checksum_md5	`Error`	66.52	test_iso.py
test_03_create_iso_with_checksum_md5	`Error`	66.52	test_iso.py
test_04_create_iso_with_no_checksum	`Error`	66.51	test_iso.py
test_04_create_iso_with_no_checksum	`Error`	66.51	test_iso.py
test_01_create_iso	`Failure`	1518.13	test_iso.py
ContextSuite context=TestISO>:setup	`Error`	3035.34	test_iso.py
ContextSuite context=TestDomainsServiceOfferings>:setup	`Error`	1519.56	test_domain_service_offerings.py
test_03_create_vpc_domain_vpc_offering	`Error`	15.87	test_domain_vpc_offerings.py
test_updating_nics_on_two_shared_networks	`Error`	0.05	test_gateway_on_shared_networks.py
ContextSuite context=TestHostControlState>:setup	`Error`	5.91	test_host_control_state.py
test_01_browser_migrate_template	`Error`	65.73	test_image_store_object_migration.py
test_01_invalid_upgrade_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster	`Failure`	0.00	test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster	`Failure`	0.00	test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster	`Failure`	0.01	test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster	`Failure`	0.00	test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle	`Error`	0.00	test_kubernetes_clusters.py
test_01_add_delete_kubernetes_supported_version	`Error`	1801.91	test_kubernetes_supported_versions.py
ContextSuite context=TestListIdsParams>:setup	`Error`	0.00	test_list_ids_parameter.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup	`Error`	10.53	test_network.py
test_01_non_strict_host_anti_affinity	`Error`	239.70	test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity	`Error`	116.39	test_nonstrict_affinity_group.py
test_CRUD_operations_userdata	`Error`	1521.61	test_register_userdata.py
test_deploy_vm_with_registered_userdata	`Error`	7.67	test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_allow	`Error`	8.77	test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_append	`Error`	7.48	test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_deny	`Error`	8.75	test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_params	`Error`	7.50	test_register_userdata.py
test_link_and_unlink_userdata_to_template	`Error`	8.42	test_register_userdata.py
test_user_userdata_crud	`Error`	7.11	test_register_userdata.py
ContextSuite context=TestRAMCPUResourceAccounting>:setup	`Error`	0.00	test_resource_accounting.py
test_02_create_volume	`Error`	5.37	test_resource_names.py
ContextSuite context=TestRouterDns>:setup	`Error`	0.00	test_router_dns.py
ContextSuite context=TestRouterDnsService>:setup	`Error`	0.00	test_router_dnsservice.py
test_05_scale_vm_dont_allow_disk_offering_change	`Failure`	82.84	test_scale_vm.py
test_01_volume_usage	`Error`	100.93	test_usage.py

sureshanaparti · 2025-04-11T13:09:40Z

@blueorangutan test matrix

blueorangutan · 2025-04-11T13:12:04Z

@sureshanaparti a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests

blueorangutan · 2025-04-11T13:45:12Z

[SF] Trillian Build Failed (tid-12962)

blueorangutan · 2025-04-11T13:54:48Z

[SF] Trillian Build Failed (tid-12960)

blueorangutan · 2025-04-12T05:28:17Z

[SF] Trillian test result (tid-12961)
Environment: kvm-ubuntu22 (x2), Advanced Networking with Mgmt server u22
Total time taken: 56356 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12961-kvm-ubuntu22.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_oobm_multiple_mgmt_server_ownership	`Failure`	30.77	test_outofbandmanagement.py

blueorangutan · 2025-04-12T09:41:28Z

[SF] Trillian test result (tid-12963)
Environment: xcpng82 (x2), Advanced Networking with Mgmt server ol9
Total time taken: 71668 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12963-xcpng82.zip
Smoke tests completed. 135 look OK, 6 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
ContextSuite context=TestClusterDRS>:setup	`Error`	0.00	test_cluster_drs.py
test_list_system_vms_metrics_history	`Failure`	0.21	test_metrics_api.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup	`Error`	10.56	test_network.py
test_01_non_strict_host_anti_affinity	`Error`	222.25	test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity	`Error`	119.41	test_nonstrict_affinity_group.py
test_02_create_volume	`Error`	2.24	test_resource_names.py
test_05_scale_vm_dont_allow_disk_offering_change	`Failure`	66.42	test_scale_vm.py

sureshanaparti · 2025-04-13T09:38:09Z

@blueorangutan test ol8 vmware-70u3

blueorangutan · 2025-04-13T09:40:06Z

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + vmware-70u3) has been kicked to run smoke tests

blueorangutan · 2025-04-14T04:51:51Z

[SF] Trillian test result (tid-12973)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 66890 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12973-vmware-70u3.zip
Smoke tests completed. 135 look OK, 6 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_events_resource	`Error`	381.00	test_events_resource.py
test_01_events_resource	`Error`	381.01	test_events_resource.py
test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80	`Error`	622.34	test_internal_lb.py
test_02_internallb_roundrobin_1RVPC_3VM_HTTP_port80	`Error`	168.90	test_internal_lb.py
test_02_internallb_roundrobin_1RVPC_3VM_HTTP_port80	`Error`	168.91	test_internal_lb.py
test_04_rvpc_internallb_haproxy_stats_on_all_interfaces	`Error`	307.64	test_internal_lb.py
test_04_deploy_vm_for_other_user_and_test_vm_operations	`Error`	118.30	test_network_permissions.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup	`Error`	1523.23	test_network.py
test_02_restore_vm_with_disk_offering	`Error`	52.02	test_restore_vm.py
test_03_restore_vm_with_disk_offering_custom_size	`Error`	55.00	test_restore_vm.py
test_02_restore_vm_strict_tags_failure	`Error`	56.30	test_vm_strict_host_tags.py

sureshanaparti · 2025-04-23T10:42:13Z

@blueorangutan package

rohityadavcloud · 2025-04-24T04:25:39Z

@sureshanaparti can you check failures
@blueorangutan package

blueorangutan · 2025-04-24T04:26:03Z

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2025-04-24T05:59:58Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13157

blueorangutan · 2025-04-24T08:28:49Z

[SF] Trillian test result (tid-13109)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 62268 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t13109-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_02_restore_vm_strict_tags_failure	`Failure`	62.70	test_vm_strict_host_tags.py
test_02_scale_vm_strict_tags_failure	`Failure`	67.81	test_vm_strict_host_tags.py
test_06_deploy_vm_on_any_host_with_strict_tags_failure	`Failure`	5.77	test_vm_strict_host_tags.py

shwstppr

code lgtm

sureshanaparti · 2025-04-29T07:29:13Z

@blueorangutan package

blueorangutan · 2025-04-29T07:30:05Z

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2025-04-29T09:00:51Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13211

rohityadavcloud · 2025-04-30T09:25:51Z

@blueorangutan test

blueorangutan · 2025-04-30T09:28:03Z

@rohityadavcloud a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan · 2025-05-01T01:25:52Z

[SF] Trillian test result (tid-13185)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 54051 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t13185-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_02_restore_vm_strict_tags_failure	`Failure`	54.34	test_vm_strict_host_tags.py
test_02_scale_vm_strict_tags_failure	`Failure`	56.53	test_vm_strict_host_tags.py
test_06_deploy_vm_on_any_host_with_strict_tags_failure	`Failure`	5.71	test_vm_strict_host_tags.py

…nodes Sometimes hypervisor hosts (direct agents) stuck with Disconnect state during agent rebalancing activity across multiple management server nodes. This issue was noticed during frequent restart of the management server nodes in the cluster. When there are multiple management server nodes in a cluster, if one or more nodes are shutdown/start/restart, CloudStack will rebalance the hosts among the remaining nodes or move the nodes to the newly joined management server nodes. During the rebalancing period multiple operations could happen including: - DirectAgentScan at interval of configured direct.agent.scan.interval - AgentRebalanceScan to identify and schedule rebalance agents - TransferAgentScan to transfer the host from original owner to future owner **Current Rebalance behavior** 1. For hosts that have AgentAttache && not forForward but in Disconnect state, CloudStack simply ignore these hosts without trying to ping again or update the status of the host. 2. For hosts that have AgentAttache && forForward, CloudStack removes the agent but still try to loadDirectlyConnectedHost. **Improved Rebalance behavior** During DirectAgentScan: scanDirectAgentToLoad(), identify hosts that for self-managed hosts that are in Disconnect state (disconnected after pingtimeout). 1. For hosts that have AgentAttache and is forForward, CloudStack should remove the agent 2. For hosts that have AgentAttache and is not forForward but in Disconnect state, CloudStack should try to investigate and update the status to Up if host is pingable. 3. For hosts that don't have AgentAttache, CloudStack should try to loadDirectlyConnectedHost.

sureshanaparti · 2025-05-05T07:42:14Z

@blueorangutan package

blueorangutan · 2025-05-05T07:44:05Z

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2025-05-05T09:08:10Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13257

DaanHoogland · 2025-05-06T06:09:20Z

@sureshanaparti can you add some test description for those that want to try to break the change?

DaanHoogland · 2025-05-06T06:09:28Z

@blueorangutan test

blueorangutan · 2025-05-06T06:10:04Z

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan · 2025-05-06T22:47:55Z

[SF] Trillian test result (tid-13213)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 56935 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t13213-kvm-ol8.zip
Smoke tests completed. 140 look OK, 0 have errors, 1 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
all_test_human_readable_logs	`Skipped`	---	test_human_readable_logs.py

borisstoyanov

LGTM

…nodes (apache#10674) Sometimes hypervisor hosts (direct agents) stuck with Disconnect state during agent rebalancing activity across multiple management server nodes. This issue was noticed during frequent restart of the management server nodes in the cluster. When there are multiple management server nodes in a cluster, if one or more nodes are shutdown/start/restart, CloudStack will rebalance the hosts among the remaining nodes or move the nodes to the newly joined management server nodes. During the rebalancing period multiple operations could happen including: - DirectAgentScan at interval of configured direct.agent.scan.interval - AgentRebalanceScan to identify and schedule rebalance agents - TransferAgentScan to transfer the host from original owner to future owner **Current Rebalance behavior** 1. For hosts that have AgentAttache && not forForward but in Disconnect state, CloudStack simply ignore these hosts without trying to ping again or update the status of the host. 2. For hosts that have AgentAttache && forForward, CloudStack removes the agent but still try to loadDirectlyConnectedHost. **Improved Rebalance behavior** During DirectAgentScan: scanDirectAgentToLoad(), identify hosts that for self-managed hosts that are in Disconnect state (disconnected after pingtimeout). 1. For hosts that have AgentAttache and is forForward, CloudStack should remove the agent 2. For hosts that have AgentAttache and is not forForward but in Disconnect state, CloudStack should try to investigate and update the status to Up if host is pingable. 3. For hosts that don't have AgentAttache, CloudStack should try to loadDirectlyConnectedHost.

boring-cyborg bot added component:agent component:orchestration labels Apr 9, 2025

sureshanaparti requested a review from weizhouapache April 9, 2025 10:26

sureshanaparti added this to the 4.20.1 milestone Apr 9, 2025

sureshanaparti added this to ACS 4.20.1 Apr 9, 2025

rohityadavcloud requested review from Pearl1594 and shwstppr April 9, 2025 10:37

weizhouapache approved these changes Apr 9, 2025

View reviewed changes

sureshanaparti moved this to In Progress in ACS 4.20.1 Apr 9, 2025

sureshanaparti force-pushed the clusteredAgentManagerImpl_rebalance_improvements branch from c9bb527 to f398273 Compare April 23, 2025 10:42

shwstppr approved these changes Apr 24, 2025

View reviewed changes

sureshanaparti force-pushed the clusteredAgentManagerImpl_rebalance_improvements branch from f398273 to bb44e18 Compare April 29, 2025 07:28

rohityadavcloud assigned borisstoyanov Apr 30, 2025

rohityadavcloud closed this Apr 30, 2025

rohityadavcloud reopened this Apr 30, 2025

sureshanaparti force-pushed the clusteredAgentManagerImpl_rebalance_improvements branch from bb44e18 to cfa0120 Compare May 5, 2025 07:42

sureshanaparti requested a review from borisstoyanov May 6, 2025 07:10

sureshanaparti marked this pull request as ready for review May 9, 2025 08:47

borisstoyanov approved these changes May 13, 2025

View reviewed changes

rohityadavcloud merged commit 95489b8 into apache:4.20 May 13, 2025
21 of 26 checks passed

DaanHoogland deleted the clusteredAgentManagerImpl_rebalance_improvements branch May 13, 2025 12:18

Direct agents rebalance improvements with multiple management server nodes #10674

Direct agents rebalance improvements with multiple management server nodes #10674

Uh oh!

Conversation

sureshanaparti commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Uh oh!

sureshanaparti commented Apr 9, 2025

Uh oh!

blueorangutan commented Apr 9, 2025

Uh oh!

codecov bot commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

blueorangutan commented Apr 9, 2025

Uh oh!

sureshanaparti commented Apr 9, 2025

Uh oh!

blueorangutan commented Apr 9, 2025

Uh oh!

weizhouapache left a comment

Choose a reason for hiding this comment

Uh oh!

blueorangutan commented Apr 9, 2025

Uh oh!

blueorangutan commented Apr 10, 2025

Uh oh!

blueorangutan commented Apr 11, 2025

Uh oh!

blueorangutan commented Apr 11, 2025

Uh oh!

sureshanaparti commented Apr 11, 2025

Uh oh!

blueorangutan commented Apr 11, 2025

Uh oh!

blueorangutan commented Apr 11, 2025

Uh oh!

blueorangutan commented Apr 11, 2025

Uh oh!

blueorangutan commented Apr 12, 2025

Uh oh!

blueorangutan commented Apr 12, 2025

Uh oh!

sureshanaparti commented Apr 13, 2025

Uh oh!

blueorangutan commented Apr 13, 2025

Uh oh!

blueorangutan commented Apr 14, 2025

Uh oh!

sureshanaparti commented Apr 23, 2025

Uh oh!

rohityadavcloud commented Apr 24, 2025

Uh oh!

blueorangutan commented Apr 24, 2025

Uh oh!

blueorangutan commented Apr 24, 2025

Uh oh!

blueorangutan commented Apr 24, 2025

Uh oh!

shwstppr left a comment

Choose a reason for hiding this comment

Uh oh!

sureshanaparti commented Apr 29, 2025

Uh oh!

blueorangutan commented Apr 29, 2025

Uh oh!

blueorangutan commented Apr 29, 2025

Uh oh!

rohityadavcloud commented Apr 30, 2025

Uh oh!

blueorangutan commented Apr 30, 2025

sureshanaparti commented Apr 9, 2025 •

edited

Loading

codecov bot commented Apr 9, 2025 •

edited

Loading