Skip to content

Conversation

@sureshanaparti
Copy link
Contributor

Description

This PR allows config drive deletion of migrated VM, on host maintenance (from it's last source location - host cache/primary/secondary storage).

Fixes below issues identified during host maintenance tests, with VM using config drive on KVM host cache.

  1. Old config drive location is overriden in the VM detail, while creating the new one during the migration (config drive location would change here, in case any of the location settings updated after VM is created - vm.configdrive.force.host.cache.use, vm.configdrive.primarypool.enabled, vm.configdrive.use.host.cache.on.unsupported.pool). If so, the rollback or post migration tries to delete the config drive from the new/overriden location in the source host after migration, where the file doesn't exists.

  2. Delete config drive command HandleConfigDriveIsoCommand is not allowed when host is in maintenance.

2024-12-04 07:36:10,315 DEBUG [c.c.n.e.ConfigDriveNetworkElement] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Deleting config drive ISO for vm: i-2-6-VM on host: 1
2024-12-04 07:36:10,318 WARN  [c.c.a.m.AgentManagerImpl] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Resource [Host:1] is unreachable: Host 1: Unable to send class com.cloud.agent.api.HandleConfigDriveIsoCommand because agent ol8.localdomain is in maintenance mode
2024-12-04 07:36:10,319 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Unable to get an answer to handle config drive deletion for vm: i-2-6-VM on host: 1
2024-12-04 07:36:10,319 INFO  [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: Unable to get an answer to handle config drive deletion for vm: i-2-6-VM on host: 1
2024-12-04 07:36:10,319 DEBUG [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60) (logid:332edac0) Done with run of VM work job: com.cloud.vm.VmWorkMigrateAway for VM 6, job origin: 58
2024-12-04 07:36:10,319 ERROR [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60) (logid:332edac0) Unable to complete AsyncJobVO: {id:60, userId: 1, accountId: 1, instanceType: null, instanceId: null, cmd: com.cloud.vm.VmWorkMigrateAway, cmdInfo: rO0ABXNyAB5jb20uY2xvdWQudm0uVm1Xb3JrTWlncmF0ZUF3YXmt4MX4jtcEmwIAAUoACXNyY0hvc3RJZHhyABNjb20uY2xvdWQudm0uVm1Xb3Jrn5m2VvAlZ2sCAARKAAlhY2NvdW50SWRKAAZ1c2VySWRKAAR2bUlkTAALaGFuZGxlck5hbWV0ABJMamF2YS9sYW5nL1N0cmluZzt4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAAABnQAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAAAQ, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 32986187695067, completeMsid: null, lastUpdated: null, lastPolled: null, created: Wed Dec 04 07:36:04 UTC 2024, removed: null}, job origin:58
com.cloud.utils.exception.CloudRuntimeException: Unable to get an answer to handle config drive deletion for vm: i-2-6-VM on host: 1
	at com.cloud.network.element.ConfigDriveNetworkElement.deleteConfigDriveIsoOnHostCache(ConfigDriveNetworkElement.java:585)
	at com.cloud.network.element.ConfigDriveNetworkElement.commitMigration(ConfigDriveNetworkElement.java:379)
	at org.apache.cloudstack.engine.orchestration.NetworkOrchestrator.commitNicForMigration(NetworkOrchestrator.java:2264)
	at com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:2904)
	at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:3488)
	at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:5535)

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Manually tested maintenance on host having some running VMs with config drive on host cache and secondary. Running VMs are migrated to other available hosts and old config drive is removed.

How did you try to break this feature and the system with this change?

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented Dec 5, 2024

Codecov Report

Attention: Patch coverage is 21.27660% with 37 lines in your changes missing coverage. Please review.

Project coverage is 15.12%. Comparing base (a278849) to head (965180d).
Report is 3 commits behind head on 4.19.

Files with missing lines Patch % Lines
...ervisor/kvm/resource/LibvirtComputingResource.java 0.00% 19 Missing ⚠️
...oud/network/element/ConfigDriveNetworkElement.java 27.27% 13 Missing and 3 partials ⚠️
.../hypervisor/kvm/storage/KVMStoragePoolManager.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19   #10045      +/-   ##
============================================
- Coverage     15.12%   15.12%   -0.01%     
- Complexity    11262    11263       +1     
============================================
  Files          5408     5408              
  Lines        473843   473888      +45     
  Branches      57771    57786      +15     
============================================
  Hits          71689    71689              
- Misses       394155   394199      +44     
- Partials       7999     8000       +1     
Flag Coverage Δ
uitests 4.30% <ø> (ø)
unittests 15.84% <21.27%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11721

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-11853)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 48716 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10045-t11853-kvm-ol8.zip
Smoke tests completed. 132 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_03_secured_to_nonsecured_vm_migration Error 401.92 test_vm_life_cycle.py

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11776

@kiranchavala
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@kiranchavala a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-11881)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 44026 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10045-t11881-kvm-ol8.zip
Smoke tests completed. 133 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11784

@sureshanaparti sureshanaparti force-pushed the config-drive-delete-issue-of-migrated-vm-on-host-maintenance branch from 4d42886 to 2e79237 Compare December 13, 2024 08:20
@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11802

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11828

@kiranchavala
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@kiranchavala a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

Copy link
Contributor

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM , Tested the issue manually by following these steps

Steps to reproduce the issue

1.Set global setting ‘vm.configdrive.force.host.cache.use’ to true (enable it).

2.Create network offering with user data on config drive.

3.Create isolated network using above network offering.

4.Create few VMs (using password enabled template) with that isolated network on the same host, and ensure the config drive is attached and created on host cache.

Logs Before the fix

2024-12-04 07:36:10,315 DEBUG [c.c.n.e.ConfigDriveNetworkElement] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Deleting config drive ISO for vm: i-2-6-VM on host: 1
2024-12-04 07:36:10,318 WARN  [c.c.a.m.AgentManagerImpl] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Resource [Host:1] is unreachable: Host 1: Unable to send class com.cloud.agent.api.HandleConfigDriveIsoCommand because agent ol8.localdomain is in maintenance mode
2024-12-04 07:36:10,319 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Unable to get an answer to handle config drive deletion for vm: i-2-6-VM on host: 1
2024-12-04 07:36:10,319 INFO  [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: Unable to get an answer to handle config drive deletion for vm: i-2-6-VM on host: 1
2024-12-04 07:36:10,319 DEBUG [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60) (logid:332edac0) Done with run of VM work job: com.cloud.vm.VmWorkMigrateAway for VM 6, job origin: 58
2024-12-04 07:36:10,319 ERROR [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60) (logid:332edac0) Unable to complete AsyncJobVO: {id:60, userId: 1, accountId: 1, instanceType: null, instanceId: null, cmd: com.cloud.vm.VmWorkMigrateAway, cmdInfo: rO0ABXNyAB5jb20uY2xvdWQudm0uVm1Xb3JrTWlncmF0ZUF3YXmt4MX4jtcEmwIAAUoACXNyY0hvc3RJZHhyABNjb20uY2xvdWQudm0uVm1Xb3Jrn5m2VvAlZ2sCAARKAAlhY2NvdW50SWRKAAZ1c2VySWRKAAR2bUlkTAALaGFuZGxlck5hbWV0ABJMamF2YS9sYW5nL1N0cmluZzt4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAAABnQAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAAAQ, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 32986187695067, completeMsid: null, lastUpdated: null, lastPolled: null, created: Wed Dec 04 07:36:04 UTC 2024, removed: null}, job origin:58
com.cloud.utils.exception.CloudRuntimeException: Unable to get an answer to handle config drive deletion for vm: i-2-6-VM on host: 1
	at com.cloud.network.element.ConfigDriveNetworkElement.deleteConfigDriveIsoOnHostCache(ConfigDriveNetworkElement.java:585)
	at com.cloud.network.element.ConfigDriveNetworkElement.commitMigration(ConfigDriveNetworkElement.java:379)
	at org.apache.cloudstack.engine.orchestration.NetworkOrchestrator.commitNicForMigration(NetworkOrchestrator.java:2264)
	at com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:2904)
	at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:3488)
	at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:5535)

Logs After the fix

No exception observed in the logs , the vm and router migrated successfully

Logs on the kvm host

2024-12-17 07:42:48,955 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:a0e9913d) Processing command: com.cloud.agent.api.HandleConfigDriveIsoCommand
2024-12-17 07:42:48,955 DEBUG [resource.wrapper.LibvirtHandleConfigDriveCommandWrapper] (agentRequest-Handler-5:null) (logid:a0e9913d) Deleting config drive: configdrive/i-2-6-VM.iso
2024-12-17 07:42:48,956 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:a0e9913d) Seq 2-4036351166030807368:  { Ans: , MgmtId: 32988486173712, via: 2, Ver: v1, Flags: 10, [{"com.cloud.agent.api.HandleConfigDriveIsoAnswer":{"result":"true","wait":"0","bypassHostMaintenance":"false"}}] }


Logs on the management server


2024-12-17 07:42:48,953 DEBUG [c.c.a.t.Request] (Work-Job-Executor-21:ctx-404c7714 job-61/job-64 ctx-68d5f4d1) (logid:a0e9913d) Seq 2-4036351166030807368: Sending  { Cmd , MgmtId: 32988486173712, via: 2(ol8.localdomain), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.HandleConfigDriveIsoCommand":{"isoFile":"configdrive/i-2-6-VM.iso","create":"false","useHostCacheOnUnsupportedPool":"false","preferHostCache":"true","wait":"0","bypassHostMaintenance":"false"}}] }
2024-12-17 07:42:48,957 DEBUG [c.c.a.t.Request] (Work-Job-Executor-21:ctx-404c7714 job-61/job-64 ctx-68d5f4d1) (logid:a0e9913d) Seq 2-4036351166030807368: Received:  { Ans: , MgmtId: 32988486173712, via: 2(ol8.localdomain), Ver: v1, Flags: 10, { HandleConfigDriveIsoAnswer } }

@sureshanaparti sureshanaparti marked this pull request as ready for review December 17, 2024 12:36
@blueorangutan
Copy link

[SF] Trillian test result (tid-11921)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 47745 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10045-t11921-kvm-ol8.zip
Smoke tests completed. 133 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@DaanHoogland DaanHoogland merged commit b4ad04b into apache:4.19 Dec 18, 2024
25 of 26 checks passed
@DaanHoogland DaanHoogland deleted the config-drive-delete-issue-of-migrated-vm-on-host-maintenance branch December 18, 2024 08:12
DaanHoogland added a commit that referenced this pull request Dec 20, 2024
* 4.20:
  VR: apply iptables rules when add/remove static routes (#10064)
  Certificate and VM hostname validation improvements (#10051)
  set ulimit for server according to redhat spec (#10040)
  kvm-storage: provide isVMMigrate information to storage plugins (#10093)
  Allow config drive deletion of migrated VM, on host maintenance (#10045)
  linstor: improve heartbeat check with also asking linstor (#10105)
  server: simplify role change validation (#9173)
  UI: create VPC network offering with conserve mode (#10082)
  server: fix typo removeaccessvpn in VirtualRouterElement (#10086)
  UI: remove duplicated Instance Name in Public IP details page (#10087)
  UI: Fixes in the Usage UI (#10000)
  SAML2: add cookie with HttpOnly too #10013 (#10047)
  ui: Allow font-awesome icon usage and optimise icon size inconsistency (#9744)
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants