-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Allow config drive deletion of migrated VM, on host maintenance #10045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow config drive deletion of migrated VM, on host maintenance #10045
Conversation
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.19 #10045 +/- ##
============================================
- Coverage 15.12% 15.12% -0.01%
- Complexity 11262 11263 +1
============================================
Files 5408 5408
Lines 473843 473888 +45
Branches 57771 57786 +15
============================================
Hits 71689 71689
- Misses 394155 394199 +44
- Partials 7999 8000 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
weizhouapache
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11721 |
|
@blueorangutan test |
|
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-11853)
|
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11776 |
|
@blueorangutan test |
|
@kiranchavala a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-11881)
|
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11784 |
… it's last source location - host cache/primary/secondary storage)
4d42886 to
2e79237
Compare
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11802 |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11828 |
|
@blueorangutan test |
|
@kiranchavala a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
kiranchavala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM , Tested the issue manually by following these steps
Steps to reproduce the issue
1.Set global setting ‘vm.configdrive.force.host.cache.use’ to true (enable it).
2.Create network offering with user data on config drive.
3.Create isolated network using above network offering.
4.Create few VMs (using password enabled template) with that isolated network on the same host, and ensure the config drive is attached and created on host cache.
Logs Before the fix
2024-12-04 07:36:10,315 DEBUG [c.c.n.e.ConfigDriveNetworkElement] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Deleting config drive ISO for vm: i-2-6-VM on host: 1
2024-12-04 07:36:10,318 WARN [c.c.a.m.AgentManagerImpl] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Resource [Host:1] is unreachable: Host 1: Unable to send class com.cloud.agent.api.HandleConfigDriveIsoCommand because agent ol8.localdomain is in maintenance mode
2024-12-04 07:36:10,319 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Unable to get an answer to handle config drive deletion for vm: i-2-6-VM on host: 1
2024-12-04 07:36:10,319 INFO [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60 ctx-43e67e9c) (logid:332edac0) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: Unable to get an answer to handle config drive deletion for vm: i-2-6-VM on host: 1
2024-12-04 07:36:10,319 DEBUG [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60) (logid:332edac0) Done with run of VM work job: com.cloud.vm.VmWorkMigrateAway for VM 6, job origin: 58
2024-12-04 07:36:10,319 ERROR [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-10:ctx-fabb0d27 job-58/job-60) (logid:332edac0) Unable to complete AsyncJobVO: {id:60, userId: 1, accountId: 1, instanceType: null, instanceId: null, cmd: com.cloud.vm.VmWorkMigrateAway, cmdInfo: rO0ABXNyAB5jb20uY2xvdWQudm0uVm1Xb3JrTWlncmF0ZUF3YXmt4MX4jtcEmwIAAUoACXNyY0hvc3RJZHhyABNjb20uY2xvdWQudm0uVm1Xb3Jrn5m2VvAlZ2sCAARKAAlhY2NvdW50SWRKAAZ1c2VySWRKAAR2bUlkTAALaGFuZGxlck5hbWV0ABJMamF2YS9sYW5nL1N0cmluZzt4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAAABnQAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAAAQ, cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, result: null, initMsid: 32986187695067, completeMsid: null, lastUpdated: null, lastPolled: null, created: Wed Dec 04 07:36:04 UTC 2024, removed: null}, job origin:58
com.cloud.utils.exception.CloudRuntimeException: Unable to get an answer to handle config drive deletion for vm: i-2-6-VM on host: 1
at com.cloud.network.element.ConfigDriveNetworkElement.deleteConfigDriveIsoOnHostCache(ConfigDriveNetworkElement.java:585)
at com.cloud.network.element.ConfigDriveNetworkElement.commitMigration(ConfigDriveNetworkElement.java:379)
at org.apache.cloudstack.engine.orchestration.NetworkOrchestrator.commitNicForMigration(NetworkOrchestrator.java:2264)
at com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:2904)
at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:3488)
at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:5535)
Logs After the fix
No exception observed in the logs , the vm and router migrated successfully
Logs on the kvm host
2024-12-17 07:42:48,955 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:a0e9913d) Processing command: com.cloud.agent.api.HandleConfigDriveIsoCommand
2024-12-17 07:42:48,955 DEBUG [resource.wrapper.LibvirtHandleConfigDriveCommandWrapper] (agentRequest-Handler-5:null) (logid:a0e9913d) Deleting config drive: configdrive/i-2-6-VM.iso
2024-12-17 07:42:48,956 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:a0e9913d) Seq 2-4036351166030807368: { Ans: , MgmtId: 32988486173712, via: 2, Ver: v1, Flags: 10, [{"com.cloud.agent.api.HandleConfigDriveIsoAnswer":{"result":"true","wait":"0","bypassHostMaintenance":"false"}}] }
Logs on the management server
2024-12-17 07:42:48,953 DEBUG [c.c.a.t.Request] (Work-Job-Executor-21:ctx-404c7714 job-61/job-64 ctx-68d5f4d1) (logid:a0e9913d) Seq 2-4036351166030807368: Sending { Cmd , MgmtId: 32988486173712, via: 2(ol8.localdomain), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.HandleConfigDriveIsoCommand":{"isoFile":"configdrive/i-2-6-VM.iso","create":"false","useHostCacheOnUnsupportedPool":"false","preferHostCache":"true","wait":"0","bypassHostMaintenance":"false"}}] }
2024-12-17 07:42:48,957 DEBUG [c.c.a.t.Request] (Work-Job-Executor-21:ctx-404c7714 job-61/job-64 ctx-68d5f4d1) (logid:a0e9913d) Seq 2-4036351166030807368: Received: { Ans: , MgmtId: 32988486173712, via: 2(ol8.localdomain), Ver: v1, Flags: 10, { HandleConfigDriveIsoAnswer } }
|
[SF] Trillian test result (tid-11921)
|
* 4.20: VR: apply iptables rules when add/remove static routes (#10064) Certificate and VM hostname validation improvements (#10051) set ulimit for server according to redhat spec (#10040) kvm-storage: provide isVMMigrate information to storage plugins (#10093) Allow config drive deletion of migrated VM, on host maintenance (#10045) linstor: improve heartbeat check with also asking linstor (#10105) server: simplify role change validation (#9173) UI: create VPC network offering with conserve mode (#10082) server: fix typo removeaccessvpn in VirtualRouterElement (#10086) UI: remove duplicated Instance Name in Public IP details page (#10087) UI: Fixes in the Usage UI (#10000) SAML2: add cookie with HttpOnly too #10013 (#10047) ui: Allow font-awesome icon usage and optimise icon size inconsistency (#9744)
Description
This PR allows config drive deletion of migrated VM, on host maintenance (from it's last source location - host cache/primary/secondary storage).
Fixes below issues identified during host maintenance tests, with VM using config drive on KVM host cache.
Old config drive location is overriden in the VM detail, while creating the new one during the migration (config drive location would change here, in case any of the location settings updated after VM is created - vm.configdrive.force.host.cache.use, vm.configdrive.primarypool.enabled, vm.configdrive.use.host.cache.on.unsupported.pool). If so, the rollback or post migration tries to delete the config drive from the new/overriden location in the source host after migration, where the file doesn't exists.
Delete config drive command HandleConfigDriveIsoCommand is not allowed when host is in maintenance.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Manually tested maintenance on host having some running VMs with config drive on host cache and secondary. Running VMs are migrated to other available hosts and old config drive is removed.
How did you try to break this feature and the system with this change?