KVM incremental snapshot feature #9270

JoaoJandre · 2024-06-18T18:19:08Z

Description

This PR solves issue #8907.

Currently, when taking a volume snapshot/backup with KVM as the hypervisor, it is always a full snapshot/backup. However, always taking full snapshots of volumes is costly for both the storage network and storage systems. To solve the aforementioned issues, this PR extends the volume snapshot feature in KVM, allowing users to create incremental volume snapshots using KVM as a hypervisor.

To give operators control over which type of snapshot is being created, a new global setting kvm.incremental.snapshot has been added, which can be changed at zone and cluster scopes; this setting is false by default. Also, the snapshot.delta.max configuration, used to control the maximum deltas when using XenServer, was extended to also limit the size of the backing chain of snapshots on primary/secondary storage.

This functionality is only available in environments with Libvirt 7.6.0+ and qemu 6.1+. If the kvm.incremental.snapshot setting is true, and the hosts do not have the required Libvirt and qemu versions, an error will be thrown when trying to take a snapshot. Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC that require a shared mount-point storage file system for KVM such as OCFS2 or GlusterFS), NFS, and local storage. Other storage types for KVM, such as CLVM and RBD, need different approaches to enable incremental backups; therefore, these are not currently supported.

Issue #8907 has more details and flowcharts of all the mapped workflows.

Docs PR: apache/cloudstack-documentation#423 / apache/cloudstack-documentation#488

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Description of tests

During testing, the kvm.incremental.snapshot setting was changed to true and the snapshot.delta.max setting was changed to 3.

Tests with snapshot.backup.to.secondary = false

For the tests in this section, a test VM was created and reused for all tests.

Snapshot creation tests

Test	Result
Access the VM, create any file in it and create volume snapshot 1 while the VM running	Full snapshot created
Access the VM, create a second file in it, create volume snapshot 2 while the VM running	Incremental snapshot created with correct size and backing chain (snapshot 1)
Stop the VM and create volume snapshot 3	Correctly created incremental snapshot
Start the VM again, create volume snapshot 4	Full snapshot created
Migrate the VM and create volume snapshot 5	Incremental snapshot created from snapshot 4
Migrate VM + ROOT volume	Exception

Snapshot restore tests

Test	Result
Access the VM, delete all previously created files, stop the VM, restore snapshot 1 and start the VM again	Restoration correctly performed, the file created in snapshot creation test 1 was present on the volume
Access the VM, delete the file restored in snapshot restore test 2, stop the VM, restore snapshot 2 and start the VM again	Restoration correctly performed, the files created in tests 1 and 2 of snapshot creation were present on the volume

Snapshot removal tests

Test	Result
Delete snapshot 5	Snapshot deleted and removed from storage
Delete snapshot 1	Snapshot deleted and not removed from storage
Delete snapshots 2 and 3	Snapshots deleted and removed from storage; furthermore, snapshot 1 was also removed from storage

Template creation test

#	Test	Result
1	Create template from snapshot 4 and create a VM using the template	Template created correctly, VM had the files created in the original VM

Tests with snapshot.backup.to.secondary = true

All tests performed in the previous sections were repeated with snapshot.backup.to.secondary = false, in addition, two additional tests were performed. For the tests in this section, a test VM was created and reused for all tests.

Snapshot creation tests

N	Test	Result
1	Migrate the VM + ROOT volume and take snapshot 6	Migration carried out and full snapshot created
2	Stop the VM, migrate the volume and take snapshot 7	Volume migration performed and incremental snapshot created from snapshot 6

I have also tested that the bitmaps are removed once the snapshots are deleted.

codecov · 2024-06-18T18:27:49Z

Codecov Report

Attention: Patch coverage is 17.54875% with 1184 lines in your changes missing coverage. Please review.

Project coverage is 16.41%. Comparing base (39c5641) to head (51d5f65).
Report is 84 commits behind head on main.

Files with missing lines	Patch %	Lines
...ud/hypervisor/kvm/storage/KVMStorageProcessor.java	9.63%	318 Missing and 1 partial ⚠️
...om/cloud/storage/snapshot/SnapshotManagerImpl.java	3.27%	111 Missing and 7 partials ⚠️
...storage/datastore/db/SnapshotDataStoreDaoImpl.java	3.26%	88 Missing and 1 partial ⚠️
...ervisor/kvm/resource/LibvirtComputingResource.java	29.16%	66 Missing and 2 partials ⚠️
...oudstack/storage/snapshot/SnapshotServiceImpl.java	22.07%	60 Missing ⚠️
...rce/wrapper/LibvirtRemoveBitmapCommandWrapper.java	1.66%	59 Missing ⚠️
...he/cloudstack/storage/snapshot/SnapshotObject.java	2.56%	38 Missing ⚠️
.../wrapper/LibvirtConvertSnapshotCommandWrapper.java	2.63%	37 Missing ⚠️
...apache/cloudstack/storage/to/SnapshotObjectTO.java	12.50%	34 Missing and 1 partial ⚠️
...torage/datastore/ObjectInDataStoreManagerImpl.java	0.00%	35 Missing ⚠️
... and 31 more

Additional details and impacted files

@@             Coverage Diff             @@
##               main    #9270     +/-   ##
===========================================
  Coverage     16.40%   16.41%             
- Complexity    13590    13622     +32     
===========================================
  Files          5692     5699      +7     
  Lines        501976   503131   +1155     
  Branches      60795    60940    +145     
===========================================
+ Hits          82369    82568    +199     
- Misses       410449   411387    +938     
- Partials       9158     9176     +18

Flag	Coverage Δ
uitests	`4.00% <ø> (ø)`
unittests	`17.27% <17.54%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

weizhouapache · 2024-06-18T19:40:36Z

Good job @JoaoJandre

DaanHoogland · 2024-06-19T06:54:53Z

Good job @JoaoJandre

second that, tnx

core/src/main/java/com/cloud/storage/resource/StorageSubsystemCommandHandlerBase.java

weizhouapache · 2024-06-20T06:29:46Z

@blueorangutan test rocky8 kvm-rocky8

alexandremattioli · 2024-06-20T11:47:26Z

@JoaoJandre nice one. Just one remark, the following sentence sounds contradictory to me "Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC)", if it supports iSCSI and FC (through a shared mountpoint) it does support block storage, I think the phrasing could cause some confusion as to which types of storage are supported.

JoaoJandre · 2024-06-20T12:21:28Z

@JoaoJandre nice one. Just one remark, the following sentence sounds contradictory to me "Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC)", if it supports iSCSI and FC (through a shared mountpoint) it does support block storage, I think the phrasing could cause some confusion as to which types of storage are supported.

Hey @alexandremattioli, I understand your confusion. However, when using shared mount-point, as far as ACS is concerned, the storage is file-based, we will not be working with blocks directly, only files (as ACS does already for shared mount point). The mentions on parenthesis are there to give an example of underlying storages that might be behind the shared mount-point.

I have updated the description to add a little more context.

github-actions · 2024-06-25T15:15:46Z

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions · 2024-06-29T01:42:02Z

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

DaanHoogland · 2024-07-01T09:20:43Z

hm, seen this a couple of times now; the bot removes and adds the has-conflicts labels in the same second and a PR without conflicts ends out being marked as having so :(

github-actions · 2024-07-06T18:47:09Z

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

JoaoJandre · 2025-05-08T19:09:48Z

@sureshanaparti are all your concerns met?

it's better to have some upgrade tests from 4.19.x as well, also tests with some more linux distros with different supported storages, and atleast one unsupported storage (to make sure there are no issues / any kind of data loss when incremental snapshots enabled).

@winterhazel @JoaoJandre @bernardodemarco @slavkap I'm currently looking at the distros & storages in the below matrix (it's for my reference to ensure someone cover the upgrade test and missing distros / storages, no action needed from your side if you don't have bandwidth for it. I'll check with others and have them covered). You can update/add if you have covered any of the distro / storage (Local storage/etc) - marked NFS assuming the earlier tests were covered there.

@sureshanaparti The PR already meets our standard prerequisites for merging. There is an exponential number of tests that can be done in any PR mixing and matching distros, versions and so on; eventually we have to draw the line when to stop testing. We have already tested with multiple distros, multiple storage types and we have tested upgrades from older versions. This PR only focuses on NFS/SharedMountPoint/LocalStorage storages; thus other types of storages are not affected.

If you do not have a specific concern regarding the PR, I will merge it. In the future, when a bug is reported we will fix it, as it has been done for any other feature.

sureshanaparti · 2025-05-09T09:43:14Z

@sureshanaparti The PR already meets our standard prerequisites for merging. There is an exponential number of tests that can be done in any PR mixing and matching distros, versions and so on; eventually we have to draw the line when to stop testing. We have already tested with multiple distros, multiple storage types and we have tested upgrades from older versions. This PR only focuses on NFS/SharedMountPoint/LocalStorage storages; thus other types of storages are not affected.

If you do not have a specific concern regarding the PR, I will merge it. In the future, when a bug is reported we will fix it, as it has been done for any other feature.

Hi @JoaoJandre & all concerned, I feel we have to check the behavior with

libvirt <= 7.6.0 and qemu <= 6.1 versions,
unsupported storages, and
mix of different storages (support + unsupported) as well.
I'll make sure that some testing, the default libvirt/qemu versions checked with other distros (Ubuntu 22/Debian 12/Oracle Linux 8/Alma Linux 9/Open SUSE 15) before the 21 release.

JoaoJandre · 2025-05-12T13:50:22Z

Merging based on approvals, testing done by @slavkap, @sureshanaparti, @winterhazel and @bernardodemarco.

@sureshanaparti if you think that this PR needs further testing, this may be done while the feature is in the main branch until 4.21 is released; which is the moment where we test everything again. If any issues arise we can fix them before the release. Let me know if you find something, and I will work to patch it ASAP.

DaanHoogland · 2025-05-12T14:47:22Z

@JoaoJandre , I think both @sureshanaparti and @slavkap still want to do tests, which is fine by me.
More important; wasn’t there a doc PR to come with this as well?

rohityadavcloud · 2025-05-12T16:50:35Z

Thanks for the work @JoaoJandre and all involved. Fwiw, my only concern remains upgrade testing when a user upgrades to a version with this feature, how will this behave for existing instances & their snapshots (also upgrade/transioning b/w qcow v2 vs qcow v3). Was any upgrade related tests done for supported storages for this feature (i.e. nfs, shared mount point & local storages) and with supported KVM distros (Ubuntu 22.04, 24.40; EL 8 & 9; openSUSE 15) /

slavkap · 2025-05-13T14:51:16Z

I also want to thank the people involved with this PR, but I also want to note that the way it was merged was not correct.

I expected:

The results of these test cases after their fix
To be tested by more people because this feature affects multiple parties
I haven’t tested it extensively with the StorPool plugin, and I guess there haven’t been any tests done with Linstor, PowerFlex, etc. (because most of the changes are in the main functionality for snapshots)
When someone approves a PR, there should be a good enough description of the reason

JoaoJandre · 2025-05-13T19:19:30Z

Thanks for the work @JoaoJandre and all involved. Fwiw, my only concern remains upgrade testing when a user upgrades to a version with this feature, how will this behave for existing instances & their snapshots (also upgrade/transioning b/w qcow v2 vs qcow v3). Was any upgrade related tests done for supported storages for this feature (i.e. nfs, shared mount point & local storages) and with supported KVM distros (Ubuntu 22.04, 24.40; EL 8 & 9; openSUSE 15) /

Hello, @rohityadavcloud This feature is disabled by default, so upgrading should not pose any issues to the users. This was also tested here #9270 (comment). This feature was tested with multiple distros, and I personally tested it with local, NFS, and shared mount point storage systems.

JoaoJandre · 2025-05-13T19:21:02Z

I also want to thank the people involved with this PR, but I also want to note that the way it was merged was not correct.

I expected:

* The results of these [test cases](https://github.com/apache/cloudstack/pull/9270#issuecomment-2751571735) after their fix

* To be tested by more people because this feature affects multiple parties

* I haven’t tested it extensively with the StorPool plugin, and I guess there haven’t been any tests done with Linstor, PowerFlex, etc. (because most of the changes are in the main functionality for snapshots)

* When someone approves a PR, there should be a good enough description of the reason

Hi @slavkap

I answered your tests back in March (see KVM incremental snapshot feature #9270 (comment)), stating that I reproduced and fixed the issues reported by you. You cannot expect me to wait more than 2 months for your follow-up on them. I would like to remind you that the Apache Foundation uses lazy consensus as its decision-making policy: "A decision-making policy which assumes general consent if no responses are posted within a defined period". For example, "I'm going to commit this by lazy consensus if no one objects within the next three days.". Taking into account that I have waited 2 months for your reply, I believe that this should apply.
This was tested by at least 4 different people, excluding myself.
You cannot expect me to test it with all possible 3rd party plugins. The PR was open for 9 months, and the developers of said plugins had plenty of time to test if they wanted. Again, you cannot expect the PR to be stuck forever because of these tests.
Are these not good descriptions to approve a PR: KVM incremental snapshot feature #9270 (review); KVM incremental snapshot feature #9270 (review)? I'm sure that if I pick any random PR that was merged in the last months, the approval messages will not be comparable to these.

To be honest, it is very sad how the community handles some PRs. We have lots and lots of PRs which affected a lot more code without the same level of documentation, clarity, attention, and tests as this one; to see this kind of pushback now is disappointing. If we want to raise the bar on PRs, let's apply the same standard to all PRs, and not only to a handpicked one.

DaanHoogland · 2025-05-14T05:44:40Z

If we want to raise the bar on PRs, let's apply the same standard to all PRs, and not only to a handpicked one.

Sorry, I dissagree @JoaoJandre . We must be able to cherrypick and give more scrutiny to PRs that we consider more risky.

I understand your dissapointment as I see this PR has seen long periods of silence and also that you have had other people do extensive testing on your change. I also don’t think you were very wrong in merging it, but there are people that would have wanted to test it more before merging. On the one hand they did say that and on the other hand they were not as explicit as they could have been. I do not think much harm is done either way.

Let’s all just continue and make sure v21 is going to be tested extensively, including this functionality.

slavkap · 2025-05-14T14:55:28Z

@JoaoJandre, I’m a bit concerned by the tone you’re setting because a community member is expressing their point of view. I’ve been in the community for almost six years, and this is the first time I've seen such a tone in a discussion.

I answered your tests back in March (see KVM incremental snapshot feature #9270 (comment)), stating that I reproduced and fixed the issues reported by you. You cannot expect me to wait more than 2 months for your follow-up on them. I would like to remind you that the Apache Foundation uses lazy consensus as its decision-making policy: "A decision-making policy which assumes general consent if no responses are posted within a defined period". For example, "I'm going to commit this by lazy consensus if no one objects within the next three days.". Taking into account that I have waited 2 months for your reply, I believe that this should apply.

I don't think that the lazy consensus applies in this case. If this is correct, then all PRs should be merged after 72h if there aren't any objections?

You can check the last 2 paragraphs on how patches are applied - link

Requesting that everything gets reviewed within 72 hours does not reflect the reality that most people working on this project are doing it in their free time, thus forcing a response time on them is not going to be productive.

This was tested by at least 4 different people, excluding myself.

Two people were from your company, and most of their scenarios are handled by the smoke tests (which are for NFS storage only); Тhe other two people were still testing this.

You cannot expect me to test it with all possible 3rd party plugins. The PR was open for 9 months, and the developers of said plugins had plenty of time to test if they wanted. Again, you cannot expect the PR to be stuck forever because of these tests.

Most of the users are using those 3rd party plugins. And no, I do not expect you to test this with 3rd party plugins, but to ping the people who are involved with those 3rd party plugins just to be sure that you’ll not break the functionality for them.

There was still a testing process on your PR, meaning there is activity on it and it won’t be stuck forever. You just needed to have some patience …

I'm sure that if I pick any random PR that was merged in the last months, the approval messages will not be comparable to these.

From a cursory glance, that doesn't seem to be the case.

To be honest, it is very sad how the community handles some PRs. We have lots and lots of PRs which affected a lot more code without the same level of documentation, clarity, attention, and tests as this one; to see this kind of pushback now is disappointing. If we want to raise the bar on PRs, let's apply the same standard to all PRs, and not only to a handpicked one.

This is the reason I'm raising this - the standard should be applied to all. Your help in this will be appreciated :)

GutoVeronezi · 2025-05-14T16:44:07Z

Two people were from your company, and most of their scenarios are handled by the smoke tests (which are for NFS storage only); Тhe other two people were still testing this.

@slavkap, please, revisit the Apache Way and Individuals compose the ASF; your statement and criteria go against our community principles.

DaanHoogland · 2025-05-14T16:56:14Z

Two people were from your company, and most of their scenarios are handled by the smoke tests (which are for NFS storage only); Тhe other two people were still testing this.

@slavkap, please, revisit the Apache Way and Individuals compose the ASF; your statement and criteria go against our community principles.

@GutoVeronezi , I don’ t see how @slavkap is going against the Apache way. I think we should stop this discussion now, without any further remarks on this thread.

GutoVeronezi · 2025-05-14T17:14:26Z

@GutoVeronezi , I don’ t see how @slavkap is going against the Apache way. I think we should stop this discussion now, without any further remarks on this thread.

@DaanHoogland, from the links:

JoaoJandre · 2025-05-14T17:15:36Z

Hello @DaanHoogland, I understand the point that a PR that changes a single line in the UI needs less scrutiny than one that changes 3k lines and 75 files. However, my point is that many PRs that have the same impact (or even more) than mine are not scrutinized as much. Thus, I'm left wondering, what is this concern really about; if it is the code or something else.

To give you a few examples:

New feature: Dynamic and Static Routing #9470 changes 19k+ lines of code across 262 files and had one single person test it New feature: Dynamic and Static Routing #9470 (review). There was no reports of tests with multiple distros, multiple network providers, or upgrade tests.
Shared Filesystem as a First Class Feature #9208 changes 8k+ lines of code across 95 files. No manual tests reported. All approvals are simple "lgtm". There was no reports of tests with multiple distros or upgrade tests.
api,server,ui: snapshot copy, multi-zone replica #7873 changes 7k+ lines of code across 132 files. Two people reported testing: one here api,server,ui: snapshot copy, multi-zone replica #7873 (comment) and here api,server,ui: snapshot copy, multi-zone replica #7873 (review), and the other here api,server,ui: snapshot copy, multi-zone replica #7873 (review). Test reports are very basic with no description of what was done. There was no reports of tests with multiple distros, multiple snapshot providers (besides storpool), or upgrade tests.

Again, it seems like the concern is not based on technical aspects.

JoaoJandre · 2025-05-14T17:19:11Z

I answered your tests back in March (see KVM incremental snapshot feature #9270 (comment)), stating that I reproduced and fixed the issues reported by you. You cannot expect me to wait more than 2 months for your follow-up on them. I would like to remind you that the Apache Foundation uses lazy consensus as its decision-making policy: "A decision-making policy which assumes general consent if no responses are posted within a defined period". For example, "I'm going to commit this by lazy consensus if no one objects within the next three days.". Taking into account that I have waited 2 months for your reply, I believe that this should apply.

I don't think that the lazy consensus applies in this case. If this is correct, then all PRs should be merged after 72h if there aren't any objections?

You can check the last 2 paragraphs on how patches are applied - link

Requesting that everything gets reviewed within 72 hours does not reflect the reality that most people working on this project are doing it in their free time, thus forcing a response time on them is not going to be productive.

I know that we do not use lazy consensus to merge PRs. However, I was not refering to using lazy consensus to merge (we already had the pre-requisites for merging), I was refering to waiting for your reply regarding the problems you originally reported, which I reproduced and fixed.

I did not expect the community to validate the PR within 72 hours, and it was not done within this timeframe. The PR was open for 10 months. The last test you performed was 2 months ago, and after I fixed the concerns, I did not hear a reply regarding that. I think it was reasonable to assume that you were ok and were not going to test anymore.

This was tested by at least 4 different people, excluding myself.

Two people were from your company, and most of their scenarios are handled by the smoke tests (which are for NFS storage only); Тhe other two people were still testing this.

We are a community, inside the Apache software foundation we should conduct ourselves as individuals, not as part of a company.

You cannot expect me to test it with all possible 3rd party plugins. The PR was open for 9 months, and the developers of said plugins had plenty of time to test if they wanted. Again, you cannot expect the PR to be stuck forever because of these tests.

Most of the users are using those 3rd party plugins. And no, I do not expect you to test this with 3rd party plugins, but to ping the people who are involved with those 3rd party plugins just to be sure that you’ll not break the functionality for them.

There was still a testing process on your PR, meaning there is activity on it and it won’t be stuck forever. You just needed to have some patience …

Looking at the PR's history, we can see that I have been waiting months at a time for replies.

Again, if you find an issue, please report it so I can fix it as soon as possible.

I'm sure that if I pick any random PR that was merged in the last months, the approval messages will not be comparable to these.

From a cursory glance, that doesn't seem to be the case.

To be honest, it is very sad how the community handles some PRs. We have lots and lots of PRs which affected a lot more code without the same level of documentation, clarity, attention, and tests as this one; to see this kind of pushback now is disappointing. If we want to raise the bar on PRs, let's apply the same standard to all PRs, and not only to a handpicked one.

This is the reason I'm raising this - the standard should be applied to all. Your help in this will be appreciated :)

Please see my reply to Daan, I gave some examples of what PRs I was talking about.

DaanHoogland · 2025-05-14T17:22:21Z

Again, it seems like the concern is not based on technical aspects.

@JoaoJandre , I understand your frustration and you are free to scrutenize others PRs to the same level. Also I don not think you were wrong in merging it! That does not mean that others can have that opinion anyway. I am pretty sure (as far as I have heard) that the concerns were purely technical.

weizhouapache · 2025-05-14T18:20:46Z

New feature: Dynamic and Static Routing #9470 changes 19k+ lines of code across 262 files and had one single person test it New feature: Dynamic and Static Routing #9470 (review). There was no reports of tests with multiple distros, multiple network providers, or upgrade tests.

@JoaoJandre

since you mentioned one of my PRs, I think I should explain

New feature: Dynamic and Static Routing #9470 has been tested by @kiranchavala , you can find the test results at New feature: Dynamic and Static Routing #9470 (review)
It is a new feature, it is not supported by any network providers at that time. Now it is supported by NSX. The support for another network provider is being developed
It is hypervisor agnostic.
I do not think upgrade test is required for the feature.

In my opinion, I think this PR is not comparable with my PR

this PR changes the existing volume snapshot feature
there are multiple implementation for volume snapshot feature on different storages
the OS/libvirt/qemu/format may matters

IMHO, networking might be the most complex component for cloud platforms, but storage is the most important component. users or customers can accept few minutes downtime, but data loss is totally unacceptable.

I totally understand that the community are very careful to this PR, due to the risk of potential data loss. As I remember around 2 year ago, there was a PR related to volume or vm migration which caused to the corrupted image (sorry I do not remember clearly).

Many community members involved are experienced engineers , they have own analysis on the risk. Actually some community members are testing or going to test this PR

nvazquez · 2025-05-14T18:47:11Z

Hi guys, sorry I've not been entirely involved on this PR but would like to give my opinion from the conversations around the last couple of days.
@JoaoJandre I believe there has been some good collaboration on this PR until around the time of merging, after which the technical concerns were raised. Perhaps there has been some miscommunication or different expectations around it, and in my opinion this PR could be seen as an example to prevent future situations similar to this one. I think the intention to merge the PR could have been communicated with some days notice as well on the PR, so for example @sureshanaparti @slavkap @rohityadavcloud could have planned to perform more tests before that. Just my 2 cents on the topic, other than that it looks like a great PR!

JoaoJandre · 2025-05-15T13:35:03Z

Hi @weizhouapache,

since you mentioned one of my PRs, I think I should explain

I selected the first big changes that had been recently merged and did not think about who the author was. My objective is not to point fingers at a single person (the PRs I mentioned all have different authors); I simply meant to show that what is happening here is not the norm in our community.

* [New feature: Dynamic and Static Routing #9470](https://github.com/apache/cloudstack/pull/9470) has been tested by @kiranchavala , you can find the test results at [New feature: Dynamic and Static Routing #9470 (review)](https://github.com/apache/cloudstack/pull/9470#pullrequestreview-2280212394)

* It is a new feature, it is not supported by any network providers at that time. Now it is supported by NSX. The support for another network provider is being developed

I would argue that incremental snapshots are a new feature for KVM as well. Furthermore, it is disabled by default. Therefore, if there is an issue or nobody wants to use it, it is a matter of not activating the setup. Also, it only affects (when activated) file-based storage types such as NFS, local storage, and shared mount point storage (e.g. iSCSI/FC with OCFS2/Gluster/or any other) with KVM.

* I do not think upgrade test is required for the feature.

Fair point. I also think that the upgrade tests done in this PR are more than enough.

I totally understand that the community are very careful to this PR, due to the risk of potential data loss. As I remember around 2 year ago, there was a PR related to volume or vm migration which caused to the corrupted image (sorry I do not remember clearly).

I remember this, I was the one who eventually fixed the bug that the PR you are mentioning was trying to fix (see the last paragraph of the description on #8911).

Many community members involved are experienced engineers , they have own analysis on the risk. Actually some community members are testing or going to test this PR

I am well aware of the work the community's experienced engineers have put in the community. I hope that they can validate the PR and report any issues, so I can fix them, as it has been done numerous times within the community.

JoaoJandre · 2025-05-15T13:36:42Z

Hi @nvazquez,

I think the intention to merge the PR could have been communicated with some days notice as well on the PR, so for example @sureshanaparti @slavkap @rohityadavcloud could have planned to perform more tests before that.

I would like to note that I gave notice saying I was going to merge: #9270 (comment). But the only reply I got was that "more tests would be good" (some of which were already done). If I got something along the lines of "Hey, I'm testing this, will post in the next days", I would have waited.

But simply stating "I need to perform more tests", taking into account the context of the PR, was not a valid concern in my opinion. As I said before, we can imagine an infinite amount of tests that can be done in a platform as complex as ours, but we have to draw the line somewhere.

Just my 2 cents on the topic, other than that it looks like a great PR!

Thanks :)

JoaoJandre · 2025-05-15T13:38:00Z

@DaanHoogland, @nvazquez, @weizhouapache, @GutoVeronezi, @slavkap, @rohityadavcloud and @sureshanaparti

I think that what has happened is very clear to everyone, and further discussion will only waste everyone's time. I would like to state again that: If you find any bugs in the feature, please report them so I can patch them as soon as possible. I will go with the suggestion @DaanHoogland gave yesterday: "I think we should stop this discussion now, without any further remarks on this thread.".

* KVM incremental snapshot feature * fix log * fix merge issues * fix creation of folder * fix snapshot update * Check for hypervisor type during parent search * fix some small bugs * fix tests * Address reviews * do not remove storPool snapshots * add support for downloading diff snaps * Add multiple zones support * make copied snapshots have normal names * address reviews * Fix in progress * continue fix * Fix bulk delete * change log to trace * Start fix on multiple secondary storages for a single zone * Fix multiple secondary storages for a single zone * Fix tests * fix log * remove bitmaps when deleting snapshots * minor fixes * update sql to new file * Fix merge issues * Create new snap chain when changing configuration * add verification * Fix snapshot operation selector * fix bitmap removal * fix chain on different storages * address reviews * fix small issue * fix test --------- Co-authored-by: João Jandre <[email protected]>

KVM incremental snapshot feature

7d638fc

boring-cyborg bot added component:agent component:api component:compute component:orchestration component:storage labels Jun 18, 2024

DaanHoogland added this to the 4.20.0.0 milestone Jun 19, 2024

DaanHoogland reviewed Jun 19, 2024

View reviewed changes

core/src/main/java/com/cloud/storage/resource/StorageSubsystemCommandHandlerBase.java Outdated Show resolved Hide resolved

fix log

47ab0c2

borisstoyanov added the status:needs-testing label Jun 20, 2024

Merge remote-tracking branch 'origin/main' into differential-snapshots

f57b1aa

github-actions bot added the status:has-conflicts label Jun 25, 2024

fix merge issues

6f41337

github-actions bot added status:has-conflicts and removed status:has-conflicts labels Jun 25, 2024

slavkap self-requested a review July 1, 2024 08:48

Merge remote-tracking branch 'origin/main' into differential-snapshots

717d330

github-actions bot added status:has-conflicts and removed status:has-conflicts labels Jul 1, 2024

Merge remote-tracking branch 'origin/main' into differential-snapshots

6024631

GutoVeronezi approved these changes May 8, 2025

View reviewed changes

JoaoJandre merged commit 6fdaf51 into apache:main May 12, 2025
20 of 26 checks passed

weizhouapache mentioned this pull request May 16, 2025

kvm: fix vm deployment from RAW template #10880

Merged

14 tasks

sureshanaparti added type:new-feature and removed status:ready-for-merge labels Jun 4, 2025

sureshanaparti added this to Apache CloudStack 4.21.0 Jun 5, 2025

sureshanaparti moved this to Done in Apache CloudStack 4.21.0 Jul 14, 2025

JoaoJandre mentioned this pull request Aug 14, 2025

Fix snapshot physical size listing #11448

Merged

14 tasks

KVM incremental snapshot feature #9270

KVM incremental snapshot feature #9270

Uh oh!

Conversation

JoaoJandre commented Jun 18, 2024 • edited by sureshanaparti Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Description of tests

Tests with snapshot.backup.to.secondary = false

Snapshot creation tests

Snapshot restore tests

Snapshot removal tests

Template creation test

Tests with snapshot.backup.to.secondary = true

Snapshot creation tests

Uh oh!

codecov bot commented Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

weizhouapache commented Jun 18, 2024

Uh oh!

DaanHoogland commented Jun 19, 2024

Uh oh!

Uh oh!

weizhouapache commented Jun 20, 2024

Uh oh!

alexandremattioli commented Jun 20, 2024

Uh oh!

JoaoJandre commented Jun 20, 2024

Uh oh!

github-actions bot commented Jun 25, 2024

Uh oh!

github-actions bot commented Jun 29, 2024

Uh oh!

DaanHoogland commented Jul 1, 2024

Uh oh!

github-actions bot commented Jul 6, 2024

Uh oh!

JoaoJandre commented May 8, 2025

Uh oh!

sureshanaparti commented May 9, 2025

Uh oh!

JoaoJandre commented May 12, 2025

Uh oh!

Uh oh!

DaanHoogland commented May 12, 2025

Uh oh!

rohityadavcloud commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slavkap commented May 13, 2025

Uh oh!

JoaoJandre commented May 13, 2025

Uh oh!

JoaoJandre commented May 13, 2025

Uh oh!

DaanHoogland commented May 14, 2025

Uh oh!

slavkap commented May 14, 2025

Uh oh!

GutoVeronezi commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaanHoogland commented May 14, 2025

Uh oh!

GutoVeronezi commented May 14, 2025

Uh oh!

JoaoJandre commented May 14, 2025

Uh oh!

JoaoJandre commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaanHoogland commented May 14, 2025

JoaoJandre commented Jun 18, 2024 •

edited by sureshanaparti

Loading

codecov bot commented Jun 18, 2024 •

edited

Loading

rohityadavcloud commented May 12, 2025 •

edited

Loading

GutoVeronezi commented May 14, 2025 •

edited

Loading

JoaoJandre commented May 14, 2025 •

edited

Loading