Skip to content

Conversation

@weizhouapache
Copy link
Member

Description

Currently when investigate a kvm host, it does not consider the state of cluster-wide pools, so the result is not correct.

There is no issue with zone-wide storage pools, as in the method findZoneWideStoragePoolsByHypervisor it has

sc.and(sc.entity().getStatus(), Op.EQ, Status.Up);

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@weizhouapache weizhouapache added this to the 4.19.3 milestone Mar 6, 2025
@codecov
Copy link

codecov bot commented Mar 6, 2025

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 15.16%. Comparing base (b41acf2) to head (11402b6).
Report is 3 commits behind head on 4.19.

Files with missing lines Patch % Lines
...vm/src/main/java/com/cloud/ha/KVMInvestigator.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19   #10516      +/-   ##
============================================
- Coverage     15.17%   15.16%   -0.01%     
+ Complexity    11332    11329       -3     
============================================
  Files          5414     5414              
  Lines        474802   474802              
  Branches      57909    57909              
============================================
- Hits          72028    72010      -18     
- Misses       394718   394740      +22     
+ Partials       8056     8052       -4     
Flag Coverage Δ
uitests 4.28% <ø> (ø)
unittests 15.89% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12679

@weizhouapache
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-12596)

@weizhouapache weizhouapache marked this pull request as ready for review March 6, 2025 14:36
@weizhouapache
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-12601)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 47468 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10516-t12601-kvm-ol8.zip
Smoke tests completed. 133 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

Copy link
Contributor

@Pearl1594 Pearl1594 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@DaanHoogland
Copy link
Contributor

DaanHoogland commented Mar 10, 2025

tested in lab env, when all storage pools are disabled immediately an error is returned that there is no pool available. When (any) one is enabled deployments work as expected.

Just foud out this doesn't test the change as expected. investigating more.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 12721

@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@Pearl1594 Pearl1594 merged commit 8ce34ad into apache:4.19 Mar 10, 2025
24 of 25 checks passed
@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12723

@weizhouapache
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-12637)

@DaanHoogland
Copy link
Contributor

@Pearl1594 tested:

without the fix:

2025-03-10 12:35:50,893 DEBUG [c.c.h.KVMInvestigator] (AgentTaskPool-2:ctx-c22e955a) (logid:950000fd) HA: HOST is ineligible legacy state Disconnected for host 1
2025-03-10 12:35:50,898 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-2:ctx-c22e955a) (logid:950000fd) KVMInvestigator was able to determine host 1 is in Disconnected

but than:

2025-03-10 12:35:50,898 DEBUG [c.c.s.StorageManagerImpl] (StatsCollector-5:ctx-5e1f25ef) (logid:dfdf31d0) Unable to send storage pool command to Pool[1|NetworkFilesystem] via 1
com.cloud.exception.OperationTimedoutException: Commands 823877256832090309 to Host 1 timed out after 3600
        at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:447)
        at com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:465)
        at com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:371)
        at com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.java:1678)

with this fix:

2025-03-10 14:41:31,888 WARN  [c.c.h.KVMInvestigator] (AgentTaskPool-1:ctx-45ac40c1) (logid:a75618f4) Agent investigation was requested on host Host {"id":1,"name":"ol8.localdomain","type":"Routing","uuid":"5f8
4e980-42e4-4f48-99ea-41e71c152f36"}, but host does not support investigation because it has no NFS storage. Skipping investigation.
2025-03-10 14:41:31,888 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-1:ctx-45ac40c1) (logid:a75618f4) KVMInvestigator unable to determine the state of the host.  Moving on.

and then other investigators are called...

(cc @weizhouapache )

@DaanHoogland DaanHoogland deleted the 4.19-kvm-investigator-pool-up-state branch March 10, 2025 14:52
@weizhouapache
Copy link
Member Author

@Pearl1594 tested:

without the fix:

2025-03-10 12:35:50,893 DEBUG [c.c.h.KVMInvestigator] (AgentTaskPool-2:ctx-c22e955a) (logid:950000fd) HA: HOST is ineligible legacy state Disconnected for host 1
2025-03-10 12:35:50,898 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-2:ctx-c22e955a) (logid:950000fd) KVMInvestigator was able to determine host 1 is in Disconnected

but than:

2025-03-10 12:35:50,898 DEBUG [c.c.s.StorageManagerImpl] (StatsCollector-5:ctx-5e1f25ef) (logid:dfdf31d0) Unable to send storage pool command to Pool[1|NetworkFilesystem] via 1
com.cloud.exception.OperationTimedoutException: Commands 823877256832090309 to Host 1 timed out after 3600
        at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:447)
        at com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:465)
        at com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:371)
        at com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.java:1678)

with this fix:

2025-03-10 14:41:31,888 WARN  [c.c.h.KVMInvestigator] (AgentTaskPool-1:ctx-45ac40c1) (logid:a75618f4) Agent investigation was requested on host Host {"id":1,"name":"ol8.localdomain","type":"Routing","uuid":"5f8
4e980-42e4-4f48-99ea-41e71c152f36"}, but host does not support investigation because it has no NFS storage. Skipping investigation.
2025-03-10 14:41:31,888 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-1:ctx-45ac40c1) (logid:a75618f4) KVMInvestigator unable to determine the state of the host.  Moving on.

and then other investigators are called...

(cc @weizhouapache )

thanks @DaanHoogland for the testing !

are you ok with the new behaviour ?

@DaanHoogland
Copy link
Contributor

Yes @weizhouapache , the objective was to make sure other investigators continue as is proven, so yes. Whether other investigations should follow is another issue.

@weizhouapache
Copy link
Member Author

Yes @weizhouapache , the objective was to make sure other investigators continue as is proven, so yes. Whether other investigations should follow is another issue.

Great, thanks @DaanHoogland
So it is fine with merging it

@Pearl1594 Pearl1594 moved this to Done in ACS 4.20.1 Mar 17, 2025
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants