Skip to content

Conversation

@sureshanaparti
Copy link
Contributor

@sureshanaparti sureshanaparti commented Apr 9, 2025

Description

This PR improves the current direct agents rebalancing activity. Sometimes hypervisor hosts (direct agents) stuck with Disconnect state during agent rebalancing activity across multiple management server nodes. This issue was noticed during frequent restart of the management server nodes in the cluster.

When there are multiple management server nodes in a cluster, if one or more nodes are shutdown/start/restart, CloudStack will rebalance the hosts among the remaining nodes or move the nodes to the newly joined management server nodes. During the rebalancing period multiple operations could happen including:

  • DirectAgentScan at interval of configured direct.agent.scan.interval
  • AgentRebalanceScan to identify and schedule rebalance agents
  • TransferAgentScan to transfer the host from original owner to future owner

Current Rebalance behavior

  1. For hosts that have AgentAttache && not forForward but in Disconnect state, CloudStack simply ignore these hosts without trying to ping again or update the status of the host.
  2. For hosts that have AgentAttache && forForward, CloudStack removes the agent but still try to loadDirectlyConnectedHost.

Improved Rebalance behavior
During DirectAgentScan: scanDirectAgentToLoad(), identify hosts that for self-managed hosts that are in Disconnect state (disconnected after pingtimeout).

  1. For hosts that have AgentAttache and is forForward, CloudStack should remove the agent
  2. For hosts that have AgentAttache and is not forForward but in Disconnect state, CloudStack should try to investigate and update the status to Up if host is pingable.
  3. For hosts that don't have AgentAttache, CloudStack should try to loadDirectlyConnectedHost.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Restart of management server nodes in cluster and few VMware hypervisor hosts.

How did you try to break this feature and the system with this change?

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented Apr 9, 2025

Codecov Report

Attention: Patch coverage is 34.48276% with 19 lines in your changes missing coverage. Please review.

Project coverage is 16.14%. Comparing base (f6d0590) to head (cfa0120).
Report is 25 commits behind head on 4.20.

Files with missing lines Patch % Lines
...cloud/agent/manager/ClusteredAgentManagerImpl.java 34.48% 19 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.20   #10674      +/-   ##
============================================
+ Coverage     16.13%   16.14%   +0.01%     
- Complexity    13220    13230      +10     
============================================
  Files          5651     5651              
  Lines        496740   496747       +7     
  Branches      60183    60184       +1     
============================================
+ Hits          80148    80213      +65     
+ Misses       407674   407600      -74     
- Partials       8918     8934      +16     
Flag Coverage Δ
uitests 4.00% <ø> (ø)
unittests 16.99% <34.48%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12989

@sureshanaparti
Copy link
Contributor Author

@blueorangutan test matrix

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-12922)

@sureshanaparti sureshanaparti moved this to In Progress in ACS 4.20.1 Apr 9, 2025
@blueorangutan
Copy link

[SF] Trillian test result (tid-12923)
Environment: kvm-ubuntu22 (x2), Advanced Networking with Mgmt server u22
Total time taken: 121473 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12923-kvm-ubuntu22.zip
Smoke tests completed. 109 look OK, 32 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 7.86 test_events_resource.py
test_01_events_resource Error 7.86 test_events_resource.py
ContextSuite context=TestAccounts>:setup Error 0.00 test_accounts.py
ContextSuite context=TestAddVmToSubDomain>:setup Error 0.00 test_accounts.py
test_DeleteDomain Error 15.16 test_accounts.py
test_forceDeleteDomain Failure 16.33 test_accounts.py
ContextSuite context=TestRemoveUserFromAccount>:setup Error 14.84 test_accounts.py
ContextSuite context=TestTemplateHierarchy>:setup Error 1536.10 test_accounts.py
ContextSuite context=TestDeployVmWithUserData>:setup Error 0.00 test_deploy_vm_with_userdata.py
ContextSuite context=TestDeployVmWithAffinityGroup>:setup Error 0.00 test_affinity_groups_projects.py
test_replace_acl_of_network Error 5.74 test_global_acls.py
ContextSuite context=TestDeployVmWithVariedPlanners>:setup Error 0.00 test_deploy_vms_with_varied_deploymentplanners.py
ContextSuite context=TestAnnotations>:setup Error 0.00 test_annotations.py
ContextSuite context=TestDeployVirtioSCSIVM>:setup Error 0.00 test_deploy_virtio_scsi_vm.py
ContextSuite context=TestInternalLb>:setup Error 0.00 test_internal_lb.py
ContextSuite context=TestDeployVMFromISO>:setup Error 0.00 test_deploy_vm_iso.py
ContextSuite context=TestDeployVMFromISOWithUefi>:setup Error 0.00 test_deploy_vm_iso_uefi.py
ContextSuite context=TestIpv4Routing>:setup Error 0.00 test_ipv4_routing.py
test_00_deploy_vm_root_resize Error 1.54 test_deploy_vm_root_resize.py
ContextSuite context=TestDeployVMsInParallel>:setup Error 0.00 test_deploy_vms_in_parallel.py
ContextSuite context=TestRemoteDiagnostics>:setup Error 0.00 test_diagnostics.py
test_01_deploy_vm_from_direct_download_template_nfs_storage Error 1.48 test_direct_download.py
ContextSuite context=TestDirectDownloadTemplates>:teardown Error 1.09 test_direct_download.py
test_01_1_create_iso_with_checksum_sha1_negative Error 66.66 test_iso.py
test_01_create_iso_with_checksum_sha1 Error 66.58 test_iso.py
test_01_create_iso_with_checksum_sha1 Error 66.58 test_iso.py
test_02_1_create_iso_with_checksum_sha256_negative Error 66.55 test_iso.py
test_02_create_iso_with_checksum_sha256 Error 66.57 test_iso.py
test_02_create_iso_with_checksum_sha256 Error 66.57 test_iso.py
test_03_1_create_iso_with_checksum_md5_negative Error 66.55 test_iso.py
test_03_create_iso_with_checksum_md5 Error 66.53 test_iso.py
test_03_create_iso_with_checksum_md5 Error 66.53 test_iso.py
test_04_create_iso_with_no_checksum Error 66.56 test_iso.py
test_04_create_iso_with_no_checksum Error 66.56 test_iso.py
test_01_create_iso Failure 1519.95 test_iso.py
ContextSuite context=TestISO>:setup Error 3038.61 test_iso.py
ContextSuite context=TestDomainsServiceOfferings>:setup Error 1520.29 test_domain_service_offerings.py
test_03_create_vpc_domain_vpc_offering Error 17.36 test_domain_vpc_offerings.py
test_updating_nics_on_two_shared_networks Error 1.74 test_gateway_on_shared_networks.py
ContextSuite context=TestGatewayOnSharedNetwork>:teardown Error 3.95 test_gateway_on_shared_networks.py
ContextSuite context=TestHostControlState>:setup Error 53.81 test_host_control_state.py
test_01_browser_migrate_template Error 65.86 test_image_store_object_migration.py
ContextSuite context=TestImportAndUnmanageVolumes>:setup Error 0.00 test_import_unmanage_volumes.py
test_01_invalid_upgrade_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 0.00 test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle Error 0.00 test_kubernetes_clusters.py
test_01_add_delete_kubernetes_supported_version Error 1802.10 test_kubernetes_supported_versions.py
ContextSuite context=TestListIdsParams>:setup Error 0.00 test_list_ids_parameter.py
test_oobm_multiple_mgmt_server_ownership Failure 31.84 test_outofbandmanagement.py
ContextSuite context=TestSnapshotRootDisk>:setup Error 0.00 test_snapshots.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_10_attachAndDetach_iso Failure 1513.77 test_vm_life_cycle.py
test_11_destroy_vm_and_volumes Error 1.42 test_vm_life_cycle.py
test_12_start_vm_multiple_volumes_allocated Error 71.12 test_vm_life_cycle.py
test_13_destroy_and_expunge_vm Error 5.08 test_vm_life_cycle.py
ContextSuite context=TestVMSchedule>:setup Error 0.00 test_vm_schedule.py
ContextSuite context=TestVmSnapshot>:setup Error 7.06 test_vm_snapshots.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-12924)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 128635 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12924-vmware-70u3.zip
Smoke tests completed. 100 look OK, 41 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestRemoteDiagnostics>:setup Error 0.00 test_diagnostics.py
ContextSuite context=TestAccounts>:setup Error 0.00 test_accounts.py
ContextSuite context=TestAddVmToSubDomain>:setup Error 0.00 test_accounts.py
test_DeleteDomain Error 13.21 test_accounts.py
test_forceDeleteDomain Failure 14.28 test_accounts.py
ContextSuite context=TestRemoveUserFromAccount>:setup Error 14.39 test_accounts.py
ContextSuite context=TestTemplateHierarchy>:setup Error 1535.53 test_accounts.py
ContextSuite context=TestDeployVmWithUserData>:setup Error 0.00 test_deploy_vm_with_userdata.py
test_01_events_resource Error 7.22 test_events_resource.py
test_01_events_resource Error 7.22 test_events_resource.py
ContextSuite context=TestDeployVmWithAffinityGroup>:setup Error 0.00 test_affinity_groups_projects.py
test_DeployVmAffinityGroup Error 13.84 test_affinity_groups.py
test_DeployVmAntiAffinityGroup Error 11.69 test_affinity_groups.py
test_01_1_create_iso_with_checksum_sha1_negative Error 66.52 test_iso.py
test_01_create_iso_with_checksum_sha1 Error 66.50 test_iso.py
test_01_create_iso_with_checksum_sha1 Error 66.50 test_iso.py
test_02_1_create_iso_with_checksum_sha256_negative Error 66.51 test_iso.py
test_02_create_iso_with_checksum_sha256 Error 66.48 test_iso.py
test_02_create_iso_with_checksum_sha256 Error 66.48 test_iso.py
test_03_1_create_iso_with_checksum_md5_negative Error 66.48 test_iso.py
test_03_create_iso_with_checksum_md5 Error 66.49 test_iso.py
test_03_create_iso_with_checksum_md5 Error 66.49 test_iso.py
test_04_create_iso_with_no_checksum Error 66.49 test_iso.py
test_04_create_iso_with_no_checksum Error 66.50 test_iso.py
test_01_create_iso Failure 1518.19 test_iso.py
ContextSuite context=TestISO>:setup Error 3036.38 test_iso.py
ContextSuite context=TestAnnotations>:setup Error 0.00 test_annotations.py
test_replace_acl_of_network Error 3.60 test_global_acls.py
ContextSuite context=TestMultipleVolumeAttach>:setup Error 0.00 test_attach_multiple_volumes.py
ContextSuite context=TestDeployVmWithVariedPlanners>:setup Error 0.00 test_deploy_vms_with_varied_deploymentplanners.py
ContextSuite context=TestConsoleEndpoint>:setup Error 0.00 test_console_endpoint.py
test_3d_gpu_support Error 1521.71 test_deploy_vgpu_enabled_vm.py
ContextSuite context=TestInternalLb>:setup Error 0.00 test_internal_lb.py
test_05_deploy_vm_with_extraconfig_vmware Error 15.71 test_deploy_vm_extra_config_data.py
ContextSuite context=TestDeployVMFromISO>:setup Error 0.00 test_deploy_vm_iso.py
test_00_deploy_vm_root_resize Error 21.98 test_deploy_vm_root_resize.py
ContextSuite context=TestDeployVMsInParallel>:setup Error 0.00 test_deploy_vms_in_parallel.py
ContextSuite context=TestIpv4Routing>:setup Error 0.00 test_ipv4_routing.py
test_01_vm_with_thin_disk_offering Error 11.57 test_disk_provisioning_types.py
test_02_vm_with_fat_disk_offering Error 11.76 test_disk_provisioning_types.py
test_03_vm_with_sparse_disk_offering Error 11.92 test_disk_provisioning_types.py
ContextSuite context=TestDomainsServiceOfferings>:setup Error 1519.54 test_domain_service_offerings.py
test_03_create_vpc_domain_vpc_offering Error 14.24 test_domain_vpc_offerings.py
test_updating_nics_on_two_shared_networks Error 1.56 test_gateway_on_shared_networks.py
ContextSuite context=TestGatewayOnSharedNetwork>:teardown Error 3.73 test_gateway_on_shared_networks.py
ContextSuite context=TestHostControlState>:setup Error 6.54 test_host_control_state.py
test_01_browser_migrate_template Error 65.72 test_image_store_object_migration.py
test_01_invalid_upgrade_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 0.00 test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle Error 0.01 test_kubernetes_clusters.py
test_01_add_delete_kubernetes_supported_version Error 1802.15 test_kubernetes_supported_versions.py
ContextSuite context=TestListIdsParams>:setup Error 0.00 test_list_ids_parameter.py
ContextSuite context=TestPrivateGwACL>:setup Error 0.00 test_privategw_acl.py
ContextSuite context=TestListVolumes>:setup Error 0.00 test_list_volumes.py
ContextSuite context=TestLoadBalance>:setup Error 0.00 test_loadbalance.py
test_04_deploy_vm_for_other_user_and_test_vm_operations Error 146.01 test_network_permissions.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1524.32 test_network.py
test_CRUD_operations_userdata Error 1521.91 test_register_userdata.py
test_deploy_vm_with_registered_userdata Error 8.33 test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_allow Error 7.81 test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_append Error 8.05 test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_deny Error 8.84 test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_params Error 7.90 test_register_userdata.py
test_link_and_unlink_userdata_to_template Error 12.82 test_register_userdata.py
test_user_userdata_crud Error 10.05 test_register_userdata.py
test_01_restore_vm Error 19.38 test_restore_vm.py
test_02_restore_vm_with_disk_offering Error 15.81 test_restore_vm.py
test_03_restore_vm_with_disk_offering_custom_size Error 12.79 test_restore_vm.py
test_04_restore_vm_allocated_root Error 20.20 test_restore_vm.py
ContextSuite context=TestRestoreVM>:teardown Error 30.69 test_restore_vm.py
ContextSuite context=TestRouterDHCPOpts>:setup Error 166.93 test_router_dhcphosts.py
ContextSuite context=TestRouterDns>:setup Error 0.00 test_router_dns.py
ContextSuite context=TestIsolatedNetworks>:setup Error 0.00 test_routers_network_ops.py
ContextSuite context=TestRedundantIsolateNetworks>:setup Error 0.00 test_routers_network_ops.py
test_01_scale_vm Error 1.45 test_scale_vm.py
test_02_scale_vm_negative_offering_disable_scaling Error 1.31 test_scale_vm.py
test_03_scale_vm_negative_vm_disable_scaling Error 1.30 test_scale_vm.py
test_04_scale_vm_with_user_account Error 10.72 test_scale_vm.py
test_05_scale_vm_dont_allow_disk_offering_change Error 1.45 test_scale_vm.py
ContextSuite context=TestServiceOfferings>:setup Error 1516.70 test_service_offerings.py
test_02_restore_vm_strict_tags_failure Error 58.38 test_vm_strict_host_tags.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-12925)
Environment: xcpng82 (x2), Advanced Networking with Mgmt server ol9
Total time taken: 141630 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12925-xcpng82.zip
Smoke tests completed. 108 look OK, 33 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 1515.94 test_events_resource.py
ContextSuite context=TestAccounts>:setup Error 0.00 test_accounts.py
ContextSuite context=TestAddVmToSubDomain>:setup Error 0.00 test_accounts.py
test_DeleteDomain Error 15.38 test_accounts.py
test_forceDeleteDomain Failure 12.93 test_accounts.py
ContextSuite context=TestRemoveUserFromAccount>:setup Error 14.00 test_accounts.py
ContextSuite context=TestTemplateHierarchy>:setup Error 1534.03 test_accounts.py
ContextSuite context=TestDeployVmWithAffinityGroup>:setup Error 0.00 test_affinity_groups_projects.py
test_replace_acl_of_network Error 4.52 test_global_acls.py
ContextSuite context=TestAnnotations>:setup Error 0.00 test_annotations.py
ContextSuite context=TestDeployVmWithUserData>:setup Error 0.00 test_deploy_vm_with_userdata.py
ContextSuite context=TestMultipleVolumeAttach>:setup Error 0.00 test_attach_multiple_volumes.py
test_01_condensed_drs_algorithm Failure 174.62 test_cluster_drs.py
test_02_balanced_drs_algorithm Failure 178.34 test_cluster_drs.py
ContextSuite context=TestInternalLb>:setup Error 0.00 test_internal_lb.py
ContextSuite context=TestDeployVMFromISO>:setup Error 0.00 test_deploy_vm_iso.py
ContextSuite context=TestIpv4Routing>:setup Error 0.00 test_ipv4_routing.py
test_00_deploy_vm_root_resize Error 1.45 test_deploy_vm_root_resize.py
ContextSuite context=TestDeployVmWithVariedPlanners>:setup Error 0.00 test_deploy_vms_with_varied_deploymentplanners.py
ContextSuite context=TestDeployVMsInParallel>:setup Error 0.00 test_deploy_vms_in_parallel.py
ContextSuite context=TestRemoteDiagnostics>:setup Error 0.00 test_diagnostics.py
test_01_1_create_iso_with_checksum_sha1_negative Error 66.53 test_iso.py
test_01_create_iso_with_checksum_sha1 Error 66.52 test_iso.py
test_01_create_iso_with_checksum_sha1 Error 66.52 test_iso.py
test_02_1_create_iso_with_checksum_sha256_negative Error 66.52 test_iso.py
test_02_create_iso_with_checksum_sha256 Error 66.51 test_iso.py
test_02_create_iso_with_checksum_sha256 Error 66.51 test_iso.py
test_03_1_create_iso_with_checksum_md5_negative Error 66.50 test_iso.py
test_03_create_iso_with_checksum_md5 Error 66.52 test_iso.py
test_03_create_iso_with_checksum_md5 Error 66.52 test_iso.py
test_04_create_iso_with_no_checksum Error 66.51 test_iso.py
test_04_create_iso_with_no_checksum Error 66.51 test_iso.py
test_01_create_iso Failure 1518.13 test_iso.py
ContextSuite context=TestISO>:setup Error 3035.34 test_iso.py
ContextSuite context=TestDomainsServiceOfferings>:setup Error 1519.56 test_domain_service_offerings.py
test_03_create_vpc_domain_vpc_offering Error 15.87 test_domain_vpc_offerings.py
test_updating_nics_on_two_shared_networks Error 0.05 test_gateway_on_shared_networks.py
ContextSuite context=TestHostControlState>:setup Error 5.91 test_host_control_state.py
test_01_browser_migrate_template Error 65.73 test_image_store_object_migration.py
test_01_invalid_upgrade_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster Failure 0.01 test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 0.01 test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster Failure 0.00 test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle Error 0.00 test_kubernetes_clusters.py
test_01_add_delete_kubernetes_supported_version Error 1801.91 test_kubernetes_supported_versions.py
ContextSuite context=TestListIdsParams>:setup Error 0.00 test_list_ids_parameter.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 10.53 test_network.py
test_01_non_strict_host_anti_affinity Error 239.70 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 116.39 test_nonstrict_affinity_group.py
test_CRUD_operations_userdata Error 1521.61 test_register_userdata.py
test_deploy_vm_with_registered_userdata Error 7.67 test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_allow Error 8.77 test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_append Error 7.48 test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_override_policy_deny Error 8.75 test_register_userdata.py
test_deploy_vm_with_registered_userdata_with_params Error 7.50 test_register_userdata.py
test_link_and_unlink_userdata_to_template Error 8.42 test_register_userdata.py
test_user_userdata_crud Error 7.11 test_register_userdata.py
ContextSuite context=TestRAMCPUResourceAccounting>:setup Error 0.00 test_resource_accounting.py
test_02_create_volume Error 5.37 test_resource_names.py
ContextSuite context=TestRouterDns>:setup Error 0.00 test_router_dns.py
ContextSuite context=TestRouterDnsService>:setup Error 0.00 test_router_dnsservice.py
test_05_scale_vm_dont_allow_disk_offering_change Failure 82.84 test_scale_vm.py
test_01_volume_usage Error 100.93 test_usage.py

@sureshanaparti
Copy link
Contributor Author

@blueorangutan test matrix

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-12962)

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-12960)

@blueorangutan
Copy link

[SF] Trillian test result (tid-12961)
Environment: kvm-ubuntu22 (x2), Advanced Networking with Mgmt server u22
Total time taken: 56356 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12961-kvm-ubuntu22.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_oobm_multiple_mgmt_server_ownership Failure 30.77 test_outofbandmanagement.py

@blueorangutan
Copy link

[SF] Trillian test result (tid-12963)
Environment: xcpng82 (x2), Advanced Networking with Mgmt server ol9
Total time taken: 71668 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12963-xcpng82.zip
Smoke tests completed. 135 look OK, 6 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestClusterDRS>:setup Error 0.00 test_cluster_drs.py
test_list_system_vms_metrics_history Failure 0.21 test_metrics_api.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 10.56 test_network.py
test_01_non_strict_host_anti_affinity Error 222.25 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 119.41 test_nonstrict_affinity_group.py
test_02_create_volume Error 2.24 test_resource_names.py
test_05_scale_vm_dont_allow_disk_offering_change Failure 66.42 test_scale_vm.py

@sureshanaparti
Copy link
Contributor Author

@blueorangutan test ol8 vmware-70u3

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + vmware-70u3) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-12973)
Environment: vmware-70u3 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 66890 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t12973-vmware-70u3.zip
Smoke tests completed. 135 look OK, 6 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 381.00 test_events_resource.py
test_01_events_resource Error 381.01 test_events_resource.py
test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80 Error 622.34 test_internal_lb.py
test_02_internallb_roundrobin_1RVPC_3VM_HTTP_port80 Error 168.90 test_internal_lb.py
test_02_internallb_roundrobin_1RVPC_3VM_HTTP_port80 Error 168.91 test_internal_lb.py
test_04_rvpc_internallb_haproxy_stats_on_all_interfaces Error 307.64 test_internal_lb.py
test_04_deploy_vm_for_other_user_and_test_vm_operations Error 118.30 test_network_permissions.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1523.23 test_network.py
test_02_restore_vm_with_disk_offering Error 52.02 test_restore_vm.py
test_03_restore_vm_with_disk_offering_custom_size Error 55.00 test_restore_vm.py
test_02_restore_vm_strict_tags_failure Error 56.30 test_vm_strict_host_tags.py

@sureshanaparti sureshanaparti force-pushed the clusteredAgentManagerImpl_rebalance_improvements branch from c9bb527 to f398273 Compare April 23, 2025 10:42
@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@rohityadavcloud
Copy link
Member

@sureshanaparti can you check failures
@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13157

@blueorangutan
Copy link

[SF] Trillian test result (tid-13109)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 62268 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t13109-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_restore_vm_strict_tags_failure Failure 62.70 test_vm_strict_host_tags.py
test_02_scale_vm_strict_tags_failure Failure 67.81 test_vm_strict_host_tags.py
test_06_deploy_vm_on_any_host_with_strict_tags_failure Failure 5.77 test_vm_strict_host_tags.py

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@sureshanaparti sureshanaparti force-pushed the clusteredAgentManagerImpl_rebalance_improvements branch from f398273 to bb44e18 Compare April 29, 2025 07:28
@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13211

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-13185)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 54051 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t13185-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_restore_vm_strict_tags_failure Failure 54.34 test_vm_strict_host_tags.py
test_02_scale_vm_strict_tags_failure Failure 56.53 test_vm_strict_host_tags.py
test_06_deploy_vm_on_any_host_with_strict_tags_failure Failure 5.71 test_vm_strict_host_tags.py

…nodes

Sometimes hypervisor hosts (direct agents) stuck with Disconnect state during agent rebalancing activity across multiple management server nodes. This issue was noticed during frequent restart of the management server nodes in the cluster.

When there are multiple management server nodes in a cluster, if one or more nodes are shutdown/start/restart, CloudStack will rebalance the hosts among the remaining nodes or move the nodes to the newly joined management server nodes. During the rebalancing period multiple operations could happen including:

- DirectAgentScan at interval of configured direct.agent.scan.interval
- AgentRebalanceScan to identify and schedule rebalance agents
- TransferAgentScan to transfer the host from original owner to future owner

**Current Rebalance behavior**

1. For hosts that have AgentAttache && not forForward but in Disconnect state, CloudStack simply ignore these hosts without trying to ping again or update the status of the host.
2. For hosts that have AgentAttache && forForward, CloudStack removes the agent but still try to loadDirectlyConnectedHost.

**Improved Rebalance behavior**
During DirectAgentScan: scanDirectAgentToLoad(),  identify hosts that for self-managed hosts that are in Disconnect state (disconnected after pingtimeout).

1. For hosts that have AgentAttache and is forForward, CloudStack should remove the agent
2. For hosts that have AgentAttache and is not forForward but in Disconnect state, CloudStack should try to investigate and update the status to Up if host is pingable.
3. For hosts that don't have AgentAttache, CloudStack should try to loadDirectlyConnectedHost.
@sureshanaparti sureshanaparti force-pushed the clusteredAgentManagerImpl_rebalance_improvements branch from bb44e18 to cfa0120 Compare May 5, 2025 07:42
@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13257

@DaanHoogland
Copy link
Contributor

@sureshanaparti can you add some test description for those that want to try to break the change?

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-13213)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 56935 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10674-t13213-kvm-ol8.zip
Smoke tests completed. 140 look OK, 0 have errors, 1 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
all_test_human_readable_logs Skipped --- test_human_readable_logs.py

@sureshanaparti sureshanaparti marked this pull request as ready for review May 9, 2025 08:47
Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rohityadavcloud rohityadavcloud merged commit 95489b8 into apache:4.20 May 13, 2025
21 of 26 checks passed
@DaanHoogland DaanHoogland deleted the clusteredAgentManagerImpl_rebalance_improvements branch May 13, 2025 12:18
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jun 19, 2025
…nodes (apache#10674)

Sometimes hypervisor hosts (direct agents) stuck with Disconnect state during agent rebalancing activity across multiple management server nodes. This issue was noticed during frequent restart of the management server nodes in the cluster.

When there are multiple management server nodes in a cluster, if one or more nodes are shutdown/start/restart, CloudStack will rebalance the hosts among the remaining nodes or move the nodes to the newly joined management server nodes. During the rebalancing period multiple operations could happen including:

- DirectAgentScan at interval of configured direct.agent.scan.interval
- AgentRebalanceScan to identify and schedule rebalance agents
- TransferAgentScan to transfer the host from original owner to future owner

**Current Rebalance behavior**

1. For hosts that have AgentAttache && not forForward but in Disconnect state, CloudStack simply ignore these hosts without trying to ping again or update the status of the host.
2. For hosts that have AgentAttache && forForward, CloudStack removes the agent but still try to loadDirectlyConnectedHost.

**Improved Rebalance behavior**
During DirectAgentScan: scanDirectAgentToLoad(),  identify hosts that for self-managed hosts that are in Disconnect state (disconnected after pingtimeout).

1. For hosts that have AgentAttache and is forForward, CloudStack should remove the agent
2. For hosts that have AgentAttache and is not forForward but in Disconnect state, CloudStack should try to investigate and update the status to Up if host is pingable.
3. For hosts that don't have AgentAttache, CloudStack should try to loadDirectlyConnectedHost.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants