Skip to content

Conversation

@sureshanaparti
Copy link
Contributor

@sureshanaparti sureshanaparti commented Jan 5, 2024

Description

Some latest active snapshots / vm snapshots are stuck in allocated / creating state when MS is stopped, these are listed / shown in UI as well (not allowed to delete). Remove allocated ones on MS start itself, and move creating ones to Error state (these will be removed by the storage garbage collector).

This PR removes snapshots / vm snapshots in allocated state on MS start, and moves creating one to Error state.

Fixes #8424

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Manually tested: Take VM Snapshot -> Stop MS -> Start MS

VM Snapshot record after MS stopped =>

                 id: 1
               uuid: 07542708-27b7-47c4-959a-f7f92828f43a
               name: i-2-3-VM_VS_20240108075812
       display_name: testvm01-snap
        description: NULL
              vm_id: 3
         account_id: 2
          domain_id: 1
service_offering_id: 1
   vm_snapshot_type: DiskAndMemory
              state: Allocated
             parent: NULL
            current: NULL
       update_count: 0
            updated: NULL
            created: 2024-01-08 07:58:12
            removed: NULL

VM Snapshot record after MS started (Not listed / shown in the UI) =>

                 id: 1
               uuid: 07542708-27b7-47c4-959a-f7f92828f43a
               name: i-2-3-VM_VS_20240108075812
       display_name: testvm01-snap
        description: NULL
              vm_id: 3
         account_id: 2
          domain_id: 1
service_offering_id: 1
   vm_snapshot_type: DiskAndMemory
              state: Allocated
             parent: NULL
            current: NULL
       update_count: 0
            updated: NULL
            created: 2024-01-08 07:58:12
            removed: 2024-01-08 07:59:26

How did you try to break this feature and the system with this change?

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented Jan 5, 2024

Codecov Report

❌ Patch coverage is 1.44928% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 16.17%. Comparing base (253ac03) to head (56f5195).
⚠️ Report is 13 commits behind head on 4.20.

Files with missing lines Patch % Lines
...stack/framework/jobs/impl/AsyncJobManagerImpl.java 2.70% 36 Missing ⚠️
...tack/storage/snapshot/SnapshotDataFactoryImpl.java 0.00% 11 Missing and 1 partial ⚠️
.../storage/vmsnapshot/DefaultVMSnapshotStrategy.java 0.00% 8 Missing ⚠️
.../storage/vmsnapshot/ScaleIOVMSnapshotStrategy.java 0.00% 8 Missing ⚠️
...a/com/cloud/vm/snapshot/VMSnapshotManagerImpl.java 0.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.20    #8452      +/-   ##
============================================
- Coverage     16.17%   16.17%   -0.01%     
- Complexity    13297    13298       +1     
============================================
  Files          5656     5656              
  Lines        498136   498218      +82     
  Branches      60432    60451      +19     
============================================
+ Hits          80584    80585       +1     
- Misses       408584   408659      +75     
- Partials       8968     8974       +6     
Flag Coverage Δ
uitests 4.00% <ø> (+<0.01%) ⬆️
unittests 17.02% <1.44%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@weizhouapache
Copy link
Member

@sureshanaparti
would it be better to clean the snapshots by storage garbage collector which run periodically ?

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8207

@sureshanaparti
Copy link
Contributor Author

@sureshanaparti would it be better to clean the snapshots by storage garbage collector which run periodically ?

@weizhouapache issue is some active snapshots / vm snapshots are left in allocated when MS is stopped, no point in keeping them and wait until storage garbage collector cleans up (what if storage cleanup is disabled?) as these are listed / shown in UI as well (which are not allowed to delete). So, I think, it's better remove them on start itself. Any other thoughts / suggestions?

@weizhouapache
Copy link
Member

@sureshanaparti would it be better to clean the snapshots by storage garbage collector which run periodically ?

@weizhouapache issue is some active snapshots / vm snapshots are left in allocated when MS is stopped, no point in keeping them and wait until storage garbage collector cleans up (what if storage cleanup is disabled?) as these are listed / shown in UI as well (which are not allowed to delete). So, I think, it's better remove them on start itself. Any other thoughts / suggestions?

@sureshanaparti
the storage garbage collector cleans up the primary and secondary storage, including the snapshot/volumes/templates

                    // remove snapshots in Error state
                    List<SnapshotVO> snapshots = _snapshotDao.listAllByStatus(Snapshot.State.Error);

We could add the cleanup of Allocated snapshot and Allocated/Error vm snapshots in the process

IMHO, it is not the right place to cleanup during mgmt server start, as mgmt server might be running for several days.

@sureshanaparti sureshanaparti marked this pull request as ready for review January 8, 2024 08:08
@sureshanaparti
Copy link
Contributor Author

sureshanaparti commented Jan 8, 2024

@sureshanaparti would it be better to clean the snapshots by storage garbage collector which run periodically ?

@weizhouapache issue is some active snapshots / vm snapshots are left in allocated when MS is stopped, no point in keeping them and wait until storage garbage collector cleans up (what if storage cleanup is disabled?) as these are listed / shown in UI as well (which are not allowed to delete). So, I think, it's better remove them on start itself. Any other thoughts / suggestions?

@sureshanaparti the storage garbage collector cleans up the primary and secondary storage, including the snapshot/volumes/templates

                    // remove snapshots in Error state
                    List<SnapshotVO> snapshots = _snapshotDao.listAllByStatus(Snapshot.State.Error);

We could add the cleanup of Allocated snapshot and Allocated/Error vm snapshots in the process

IMHO, it is not the right place to cleanup during mgmt server start, as mgmt server might be running for several days.

@weizhouapache Agreed for the resources in error state (as there might be things to reset / cleanup) and MS is running. snapshot / vm snapshots are in Allocated, mostly for the very recent active snapshot / vm snapshots before MS is stopped, so no need for any cleanup, better to set them as removed (Any event through state transition would still keep it in Allocated). Otherwise, at least have to not list them on UI, or allow explicit deletion from the UI.

Snapshots in destroying state are deleted here. So, I thought, it's better to remove there on MS start.

//destroy snapshots in destroying state
List<SnapshotVO> snapshots = _snapshotDao.listAllByStatus(Snapshot.State.Destroying);
for (SnapshotVO snapshotVO : snapshots) {
try {
if (!deleteSnapshot(snapshotVO.getId(), null)) {

@weizhouapache
Copy link
Member

@weizhouapache issue is some active snapshots / vm snapshots are left in allocated when MS is stopped, no point in keeping them and wait until storage garbage collector cleans up (what if storage cleanup is disabled?) as these are listed / shown in UI as well (which are not allowed to delete). So, I think, it's better remove them on start itself. Any other thoughts / suggestions?

@sureshanaparti the storage garbage collector cleans up the primary and secondary storage, including the snapshot/volumes/templates

                    // remove snapshots in Error state
                    List<SnapshotVO> snapshots = _snapshotDao.listAllByStatus(Snapshot.State.Error);

We could add the cleanup of Allocated snapshot and Allocated/Error vm snapshots in the process
IMHO, it is not the right place to cleanup during mgmt server start, as mgmt server might be running for several days.

@weizhouapache Agreed for the resources in error state (as there might be things to reset / cleanup) and MS is running. snapshot / vm snapshots are in Allocated, mostly for the very recent active snapshot / vm snapshots before MS is stopped, so no need for any cleanup, better to set them as removed (Any event through state transition would still keep it in Allocated). Otherwise, at least have to not list them on UI, or allow explicit deletion from the UI.

Snapshots in destroying state are deleted here. So, I thought, it's better to remove there on MS start.

//destroy snapshots in destroying state
List<SnapshotVO> snapshots = _snapshotDao.listAllByStatus(Snapshot.State.Destroying);
for (SnapshotVO snapshotVO : snapshots) {
try {
if (!deleteSnapshot(snapshotVO.getId(), null)) {

@sureshanaparti
the mgmt service is a key service for some users, especially on production.
the service is running for several days without stopping.
IMHO it is not good to cleanup resource during start ...

(OK with cleaning up resource in both start process and garbage collector)

@sureshanaparti
Copy link
Contributor Author

@weizhouapache issue is some active snapshots / vm snapshots are left in allocated when MS is stopped, no point in keeping them and wait until storage garbage collector cleans up (what if storage cleanup is disabled?) as these are listed / shown in UI as well (which are not allowed to delete). So, I think, it's better remove them on start itself. Any other thoughts / suggestions?

@sureshanaparti the storage garbage collector cleans up the primary and secondary storage, including the snapshot/volumes/templates

                    // remove snapshots in Error state
                    List<SnapshotVO> snapshots = _snapshotDao.listAllByStatus(Snapshot.State.Error);

We could add the cleanup of Allocated snapshot and Allocated/Error vm snapshots in the process
IMHO, it is not the right place to cleanup during mgmt server start, as mgmt server might be running for several days.

@weizhouapache Agreed for the resources in error state (as there might be things to reset / cleanup) and MS is running. snapshot / vm snapshots are in Allocated, mostly for the very recent active snapshot / vm snapshots before MS is stopped, so no need for any cleanup, better to set them as removed (Any event through state transition would still keep it in Allocated). Otherwise, at least have to not list them on UI, or allow explicit deletion from the UI.
Snapshots in destroying state are deleted here. So, I thought, it's better to remove there on MS start.

//destroy snapshots in destroying state
List<SnapshotVO> snapshots = _snapshotDao.listAllByStatus(Snapshot.State.Destroying);
for (SnapshotVO snapshotVO : snapshots) {
try {
if (!deleteSnapshot(snapshotVO.getId(), null)) {

@sureshanaparti the mgmt service is a key service for some users, especially on production. the service is running for several days without stopping. IMHO it is not good to cleanup resource during start ...

(OK with cleaning up resource in both start process and garbage collector)

ok, will check & update. thanks @weizhouapache

@sureshanaparti sureshanaparti marked this pull request as draft January 8, 2024 10:16
@DaanHoogland DaanHoogland added this to the 4.19.1.0 milestone Jan 22, 2024
@github-actions
Copy link

github-actions bot commented Feb 8, 2024

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@DaanHoogland
Copy link
Contributor

@sureshanaparti , does this need rebasing?

@DaanHoogland
Copy link
Contributor

on another note, is it maybe safe to first mark it as State == Error and then delete? (I think this is the same @weizhouapache suggests)

@sureshanaparti sureshanaparti changed the title Remove allocated snapshots / vm snapshots on start [WIP] Remove allocated snapshots / vm snapshots on start Jun 25, 2024
@sureshanaparti sureshanaparti modified the milestones: 4.19.1.0, 4.19.2.0 Jun 25, 2024
@DaanHoogland
Copy link
Contributor

@sureshanaparti , any progress on this?

@DaanHoogland DaanHoogland changed the base branch from main to 4.19 February 4, 2025 14:12
@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 12329

@DaanHoogland DaanHoogland removed this from the 4.19.2 milestone Feb 7, 2025
@weizhouapache weizhouapache changed the base branch from 4.19 to 4.20 September 11, 2025 14:44
@github-actions
Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

Copy link
Member

@vishesh92 vishesh92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@sureshanaparti sureshanaparti force-pushed the remove-allocated-snapshots-on-start branch from 6c7d975 to 549842b Compare September 12, 2025 14:10
@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 14985

@weizhouapache
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-14314)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 54326 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8452-t14314-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestClusterDRS>:setup Error 0.00 test_cluster_drs.py

@weizhouapache
Copy link
Member

@sureshanaparti

tested mgmt server restart during volume snapshot and vm snapshot

  • vm snapshot is transit from Creating to Error state
  • volume snapshot is transit from Creating to Error state too. However, the volume is still snapshotting

… job not finished and snapshot in Creating state
@sureshanaparti
Copy link
Contributor Author

@sureshanaparti

tested mgmt server restart during volume snapshot and vm snapshot

  • vm snapshot is transit from Creating to Error state
  • volume snapshot is transit from Creating to Error state too. However, the volume is still snapshotting

updated @weizhouapache, to change volume back to Ready (from Snapshotting).

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15059

@weizhouapache
Copy link
Member

tested ok

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-14416)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 54019 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8452-t14416-kvm-ol8.zip
Smoke tests completed. 141 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@weizhouapache weizhouapache merged commit 40dec99 into apache:4.20 Sep 23, 2025
25 of 26 checks passed
@DaanHoogland DaanHoogland deleted the remove-allocated-snapshots-on-start branch September 24, 2025 07:06
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Oct 17, 2025
…g ones to Error on MS start (apache#8452)

* Remove allocated snapshots / vm snapshots on start

* Check and Cleanup snapshots / vm snapshots on MS start

* rebase fixes

* Update volume state (from Snapshotting) on MS start when its snapshot job not finished and snapshot in Creating state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vm Snapshot left in allocated state are not cleaned up

7 participants