Skip to content

Conversation

@JoaoJandre
Copy link
Contributor

@JoaoJandre JoaoJandre commented Jun 18, 2024

Description

This PR solves issue #8907.

Currently, when taking a volume snapshot/backup with KVM as the hypervisor, it is always a full snapshot/backup. However, always taking full snapshots of volumes is costly for both the storage network and storage systems. To solve the aforementioned issues, this PR extends the volume snapshot feature in KVM, allowing users to create incremental volume snapshots using KVM as a hypervisor.

To give operators control over which type of snapshot is being created, a new global setting kvm.incremental.snapshot has been added, which can be changed at zone and cluster scopes; this setting is false by default. Also, the snapshot.delta.max configuration, used to control the maximum deltas when using XenServer, was extended to also limit the size of the backing chain of snapshots on primary/secondary storage.

This functionality is only available in environments with Libvirt 7.6.0+ and qemu 6.1+. If the kvm.incremental.snapshot setting is true, and the hosts do not have the required Libvirt and qemu versions, an error will be thrown when trying to take a snapshot. Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC that require a shared mount-point storage file system for KVM such as OCFS2 or GlusterFS), NFS, and local storage. Other storage types for KVM, such as CLVM and RBD, need different approaches to enable incremental backups; therefore, these are not currently supported.

Issue #8907 has more details and flowcharts of all the mapped workflows.

Docs PR: apache/cloudstack-documentation#423 / apache/cloudstack-documentation#488

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Description of tests

During testing, the kvm.incremental.snapshot setting was changed to true and the snapshot.delta.max setting was changed to 3.

Tests with snapshot.backup.to.secondary = false

For the tests in this section, a test VM was created and reused for all tests.

Snapshot creation tests

Test Result
Access the VM, create any file in it and create volume snapshot 1 while the VM running Full snapshot created
Access the VM, create a second file in it, create volume snapshot 2 while the VM running Incremental snapshot created with correct size and backing chain (snapshot 1)
Stop the VM and create volume snapshot 3 Correctly created incremental snapshot
Start the VM again, create volume snapshot 4 Full snapshot created
Migrate the VM and create volume snapshot 5 Incremental snapshot created from snapshot 4
Migrate VM + ROOT volume Exception

Snapshot restore tests

Test Result
Access the VM, delete all previously created files, stop the VM, restore snapshot 1 and start the VM again Restoration correctly performed, the file created in snapshot creation test 1 was present on the volume
Access the VM, delete the file restored in snapshot restore test 2, stop the VM, restore snapshot 2 and start the VM again Restoration correctly performed, the files created in tests 1 and 2 of snapshot creation were present on the volume

Snapshot removal tests

Test Result
Delete snapshot 5 Snapshot deleted and removed from storage
Delete snapshot 1 Snapshot deleted and not removed from storage
Delete snapshots 2 and 3 Snapshots deleted and removed from storage; furthermore, snapshot 1 was also removed from storage

Template creation test

# Test Result
1 Create template from snapshot 4 and create a VM using the template Template created correctly, VM had the files created in the original VM

Tests with snapshot.backup.to.secondary = true

All tests performed in the previous sections were repeated with snapshot.backup.to.secondary = false, in addition, two additional tests were performed. For the tests in this section, a test VM was created and reused for all tests.

Snapshot creation tests

N Test Result
1 Migrate the VM + ROOT volume and take snapshot 6 Migration carried out and full snapshot created
2 Stop the VM, migrate the volume and take snapshot 7 Volume migration performed and incremental snapshot created from snapshot 6

I have also tested that the bitmaps are removed once the snapshots are deleted.

@codecov
Copy link

codecov bot commented Jun 18, 2024

Codecov Report

Attention: Patch coverage is 17.54875% with 1184 lines in your changes missing coverage. Please review.

Project coverage is 16.41%. Comparing base (39c5641) to head (51d5f65).
Report is 84 commits behind head on main.

Files with missing lines Patch % Lines
...ud/hypervisor/kvm/storage/KVMStorageProcessor.java 9.63% 318 Missing and 1 partial ⚠️
...om/cloud/storage/snapshot/SnapshotManagerImpl.java 3.27% 111 Missing and 7 partials ⚠️
...storage/datastore/db/SnapshotDataStoreDaoImpl.java 3.26% 88 Missing and 1 partial ⚠️
...ervisor/kvm/resource/LibvirtComputingResource.java 29.16% 66 Missing and 2 partials ⚠️
...oudstack/storage/snapshot/SnapshotServiceImpl.java 22.07% 60 Missing ⚠️
...rce/wrapper/LibvirtRemoveBitmapCommandWrapper.java 1.66% 59 Missing ⚠️
...he/cloudstack/storage/snapshot/SnapshotObject.java 2.56% 38 Missing ⚠️
.../wrapper/LibvirtConvertSnapshotCommandWrapper.java 2.63% 37 Missing ⚠️
...apache/cloudstack/storage/to/SnapshotObjectTO.java 12.50% 34 Missing and 1 partial ⚠️
...torage/datastore/ObjectInDataStoreManagerImpl.java 0.00% 35 Missing ⚠️
... and 31 more
Additional details and impacted files
@@             Coverage Diff             @@
##               main    #9270     +/-   ##
===========================================
  Coverage     16.40%   16.41%             
- Complexity    13590    13622     +32     
===========================================
  Files          5692     5699      +7     
  Lines        501976   503131   +1155     
  Branches      60795    60940    +145     
===========================================
+ Hits          82369    82568    +199     
- Misses       410449   411387    +938     
- Partials       9158     9176     +18     
Flag Coverage Δ
uitests 4.00% <ø> (ø)
unittests 17.27% <17.54%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@weizhouapache
Copy link
Member

Good job @JoaoJandre

@DaanHoogland DaanHoogland added this to the 4.20.0.0 milestone Jun 19, 2024
@DaanHoogland
Copy link
Contributor

Good job @JoaoJandre

second that, tnx

@weizhouapache
Copy link
Member

@blueorangutan test rocky8 kvm-rocky8

@alexandremattioli
Copy link
Contributor

@JoaoJandre nice one. Just one remark, the following sentence sounds contradictory to me "Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC)", if it supports iSCSI and FC (through a shared mountpoint) it does support block storage, I think the phrasing could cause some confusion as to which types of storage are supported.

@JoaoJandre
Copy link
Contributor Author

@JoaoJandre nice one. Just one remark, the following sentence sounds contradictory to me "Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC)", if it supports iSCSI and FC (through a shared mountpoint) it does support block storage, I think the phrasing could cause some confusion as to which types of storage are supported.

Hey @alexandremattioli, I understand your confusion. However, when using shared mount-point, as far as ACS is concerned, the storage is file-based, we will not be working with blocks directly, only files (as ACS does already for shared mount point). The mentions on parenthesis are there to give an example of underlying storages that might be behind the shared mount-point.

I have updated the description to add a little more context.

@github-actions
Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@github-actions
Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@slavkap slavkap self-requested a review July 1, 2024 08:48
@DaanHoogland
Copy link
Contributor

hm, seen this a couple of times now; the bot removes and adds the has-conflicts labels in the same second and a PR without conflicts ends out being marked as having so :(

@github-actions
Copy link

github-actions bot commented Jul 6, 2024

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@JoaoJandre
Copy link
Contributor Author

@sureshanaparti are all your concerns met?

it's better to have some upgrade tests from 4.19.x as well, also tests with some more linux distros with different supported storages, and atleast one unsupported storage (to make sure there are no issues / any kind of data loss when incremental snapshots enabled).

@winterhazel @JoaoJandre @bernardodemarco @slavkap I'm currently looking at the distros & storages in the below matrix (it's for my reference to ensure someone cover the upgrade test and missing distros / storages, no action needed from your side if you don't have bandwidth for it. I'll check with others and have them covered). You can update/add if you have covered any of the distro / storage (Local storage/etc) - marked NFS assuming the earlier tests were covered there.

@sureshanaparti The PR already meets our standard prerequisites for merging. There is an exponential number of tests that can be done in any PR mixing and matching distros, versions and so on; eventually we have to draw the line when to stop testing. We have already tested with multiple distros, multiple storage types and we have tested upgrades from older versions. This PR only focuses on NFS/SharedMountPoint/LocalStorage storages; thus other types of storages are not affected.

If you do not have a specific concern regarding the PR, I will merge it. In the future, when a bug is reported we will fix it, as it has been done for any other feature.

@sureshanaparti
Copy link
Contributor

@sureshanaparti The PR already meets our standard prerequisites for merging. There is an exponential number of tests that can be done in any PR mixing and matching distros, versions and so on; eventually we have to draw the line when to stop testing. We have already tested with multiple distros, multiple storage types and we have tested upgrades from older versions. This PR only focuses on NFS/SharedMountPoint/LocalStorage storages; thus other types of storages are not affected.

If you do not have a specific concern regarding the PR, I will merge it. In the future, when a bug is reported we will fix it, as it has been done for any other feature.

Hi @JoaoJandre & all concerned, I feel we have to check the behavior with

  • libvirt <= 7.6.0 and qemu <= 6.1 versions,
  • unsupported storages, and
  • mix of different storages (support + unsupported) as well.
    I'll make sure that some testing, the default libvirt/qemu versions checked with other distros (Ubuntu 22/Debian 12/Oracle Linux 8/Alma Linux 9/Open SUSE 15) before the 21 release.

@JoaoJandre
Copy link
Contributor Author

Merging based on approvals, testing done by @slavkap, @sureshanaparti, @winterhazel and @bernardodemarco.

@sureshanaparti if you think that this PR needs further testing, this may be done while the feature is in the main branch until 4.21 is released; which is the moment where we test everything again. If any issues arise we can fix them before the release. Let me know if you find something, and I will work to patch it ASAP.

@JoaoJandre JoaoJandre merged commit 6fdaf51 into apache:main May 12, 2025
20 of 26 checks passed
@DaanHoogland
Copy link
Contributor

@JoaoJandre , I think both @sureshanaparti and @slavkap still want to do tests, which is fine by me.
More important; wasn’t there a doc PR to come with this as well?

@rohityadavcloud
Copy link
Member

rohityadavcloud commented May 12, 2025

Thanks for the work @JoaoJandre and all involved. Fwiw, my only concern remains upgrade testing when a user upgrades to a version with this feature, how will this behave for existing instances & their snapshots (also upgrade/transioning b/w qcow v2 vs qcow v3). Was any upgrade related tests done for supported storages for this feature (i.e. nfs, shared mount point & local storages) and with supported KVM distros (Ubuntu 22.04, 24.40; EL 8 & 9; openSUSE 15) /

@slavkap
Copy link
Contributor

slavkap commented May 13, 2025

I also want to thank the people involved with this PR, but I also want to note that the way it was merged was not correct.

I expected:

  • The results of these test cases after their fix
  • To be tested by more people because this feature affects multiple parties
  • I haven’t tested it extensively with the StorPool plugin, and I guess there haven’t been any tests done with Linstor, PowerFlex, etc. (because most of the changes are in the main functionality for snapshots)
  • When someone approves a PR, there should be a good enough description of the reason

@JoaoJandre
Copy link
Contributor Author

Thanks for the work @JoaoJandre and all involved. Fwiw, my only concern remains upgrade testing when a user upgrades to a version with this feature, how will this behave for existing instances & their snapshots (also upgrade/transioning b/w qcow v2 vs qcow v3). Was any upgrade related tests done for supported storages for this feature (i.e. nfs, shared mount point & local storages) and with supported KVM distros (Ubuntu 22.04, 24.40; EL 8 & 9; openSUSE 15) /

Hello, @rohityadavcloud This feature is disabled by default, so upgrading should not pose any issues to the users. This was also tested here #9270 (comment). This feature was tested with multiple distros, and I personally tested it with local, NFS, and shared mount point storage systems.

@JoaoJandre
Copy link
Contributor Author

I also want to thank the people involved with this PR, but I also want to note that the way it was merged was not correct.

I expected:

* The results of these [test cases](https://github.com/apache/cloudstack/pull/9270#issuecomment-2751571735) after their fix

* To be tested by more people because this feature affects multiple parties

* I haven’t tested it extensively with the StorPool plugin, and I guess there haven’t been any tests done with Linstor, PowerFlex, etc. (because most of the changes are in the main functionality for snapshots)

* When someone approves a PR, there should be a good enough description of the reason

Hi @slavkap

  1. I answered your tests back in March (see KVM incremental snapshot feature #9270 (comment)), stating that I reproduced and fixed the issues reported by you. You cannot expect me to wait more than 2 months for your follow-up on them. I would like to remind you that the Apache Foundation uses lazy consensus as its decision-making policy: "A decision-making policy which assumes general consent if no responses are posted within a defined period". For example, "I'm going to commit this by lazy consensus if no one objects within the next three days.". Taking into account that I have waited 2 months for your reply, I believe that this should apply.

  2. This was tested by at least 4 different people, excluding myself.

  3. You cannot expect me to test it with all possible 3rd party plugins. The PR was open for 9 months, and the developers of said plugins had plenty of time to test if they wanted. Again, you cannot expect the PR to be stuck forever because of these tests.

  4. Are these not good descriptions to approve a PR: KVM incremental snapshot feature #9270 (review); KVM incremental snapshot feature #9270 (review)? I'm sure that if I pick any random PR that was merged in the last months, the approval messages will not be comparable to these.

To be honest, it is very sad how the community handles some PRs. We have lots and lots of PRs which affected a lot more code without the same level of documentation, clarity, attention, and tests as this one; to see this kind of pushback now is disappointing. If we want to raise the bar on PRs, let's apply the same standard to all PRs, and not only to a handpicked one.

@DaanHoogland
Copy link
Contributor

If we want to raise the bar on PRs, let's apply the same standard to all PRs, and not only to a handpicked one.

Sorry, I dissagree @JoaoJandre . We must be able to cherrypick and give more scrutiny to PRs that we consider more risky.

I understand your dissapointment as I see this PR has seen long periods of silence and also that you have had other people do extensive testing on your change. I also don’t think you were very wrong in merging it, but there are people that would have wanted to test it more before merging. On the one hand they did say that and on the other hand they were not as explicit as they could have been. I do not think much harm is done either way.

Let’s all just continue and make sure v21 is going to be tested extensively, including this functionality.

@slavkap
Copy link
Contributor

slavkap commented May 14, 2025

@JoaoJandre, I’m a bit concerned by the tone you’re setting because a community member is expressing their point of view. I’ve been in the community for almost six years, and this is the first time I've seen such a tone in a discussion.

  1. I answered your tests back in March (see KVM incremental snapshot feature #9270 (comment)), stating that I reproduced and fixed the issues reported by you. You cannot expect me to wait more than 2 months for your follow-up on them. I would like to remind you that the Apache Foundation uses lazy consensus as its decision-making policy: "A decision-making policy which assumes general consent if no responses are posted within a defined period". For example, "I'm going to commit this by lazy consensus if no one objects within the next three days.". Taking into account that I have waited 2 months for your reply, I believe that this should apply.

I don't think that the lazy consensus applies in this case. If this is correct, then all PRs should be merged after 72h if there aren't any objections?

You can check the last 2 paragraphs on how patches are applied - link

Requesting that everything gets reviewed within 72 hours does not reflect the reality that most people working on this project are doing it in their free time, thus forcing a response time on them is not going to be productive.

  1. This was tested by at least 4 different people, excluding myself.

Two people were from your company, and most of their scenarios are handled by the smoke tests (which are for NFS storage only); Тhe other two people were still testing this.

  1. You cannot expect me to test it with all possible 3rd party plugins. The PR was open for 9 months, and the developers of said plugins had plenty of time to test if they wanted. Again, you cannot expect the PR to be stuck forever because of these tests.

Most of the users are using those 3rd party plugins. And no, I do not expect you to test this with 3rd party plugins, but to ping the people who are involved with those 3rd party plugins just to be sure that you’ll not break the functionality for them.

There was still a testing process on your PR, meaning there is activity on it and it won’t be stuck forever. You just needed to have some patience …

I'm sure that if I pick any random PR that was merged in the last months, the approval messages will not be comparable to these.

From a cursory glance, that doesn't seem to be the case.

To be honest, it is very sad how the community handles some PRs. We have lots and lots of PRs which affected a lot more code without the same level of documentation, clarity, attention, and tests as this one; to see this kind of pushback now is disappointing. If we want to raise the bar on PRs, let's apply the same standard to all PRs, and not only to a handpicked one.

This is the reason I'm raising this - the standard should be applied to all. Your help in this will be appreciated :)

@GutoVeronezi
Copy link
Contributor

GutoVeronezi commented May 14, 2025

Two people were from your company, and most of their scenarios are handled by the smoke tests (which are for NFS storage only); Тhe other two people were still testing this.

@slavkap, please, revisit the Apache Way and Individuals compose the ASF; your statement and criteria go against our community principles.

@DaanHoogland
Copy link
Contributor

Two people were from your company, and most of their scenarios are handled by the smoke tests (which are for NFS storage only); Тhe other two people were still testing this.

@slavkap, please, revisit the Apache Way and Individuals compose the ASF; your statement and criteria go against our community principles.

@GutoVeronezi , I don’ t see how @slavkap is going against the Apache way. I think we should stop this discussion now, without any further remarks on this thread.

@GutoVeronezi
Copy link
Contributor

@GutoVeronezi , I don’ t see how @slavkap is going against the Apache way. I think we should stop this discussion now, without any further remarks on this thread.

@DaanHoogland, from the links:

image

image

@JoaoJandre
Copy link
Contributor Author

Hello @DaanHoogland, I understand the point that a PR that changes a single line in the UI needs less scrutiny than one that changes 3k lines and 75 files. However, my point is that many PRs that have the same impact (or even more) than mine are not scrutinized as much. Thus, I'm left wondering, what is this concern really about; if it is the code or something else.

To give you a few examples:

Again, it seems like the concern is not based on technical aspects.

@JoaoJandre
Copy link
Contributor Author

JoaoJandre commented May 14, 2025

  1. I answered your tests back in March (see KVM incremental snapshot feature #9270 (comment)), stating that I reproduced and fixed the issues reported by you. You cannot expect me to wait more than 2 months for your follow-up on them. I would like to remind you that the Apache Foundation uses lazy consensus as its decision-making policy: "A decision-making policy which assumes general consent if no responses are posted within a defined period". For example, "I'm going to commit this by lazy consensus if no one objects within the next three days.". Taking into account that I have waited 2 months for your reply, I believe that this should apply.

I don't think that the lazy consensus applies in this case. If this is correct, then all PRs should be merged after 72h if there aren't any objections?

You can check the last 2 paragraphs on how patches are applied - link

Requesting that everything gets reviewed within 72 hours does not reflect the reality that most people working on this project are doing it in their free time, thus forcing a response time on them is not going to be productive.

I know that we do not use lazy consensus to merge PRs. However, I was not refering to using lazy consensus to merge (we already had the pre-requisites for merging), I was refering to waiting for your reply regarding the problems you originally reported, which I reproduced and fixed.

I did not expect the community to validate the PR within 72 hours, and it was not done within this timeframe. The PR was open for 10 months. The last test you performed was 2 months ago, and after I fixed the concerns, I did not hear a reply regarding that. I think it was reasonable to assume that you were ok and were not going to test anymore.

  1. This was tested by at least 4 different people, excluding myself.

Two people were from your company, and most of their scenarios are handled by the smoke tests (which are for NFS storage only); Тhe other two people were still testing this.

We are a community, inside the Apache software foundation we should conduct ourselves as individuals, not as part of a company.

  1. You cannot expect me to test it with all possible 3rd party plugins. The PR was open for 9 months, and the developers of said plugins had plenty of time to test if they wanted. Again, you cannot expect the PR to be stuck forever because of these tests.

Most of the users are using those 3rd party plugins. And no, I do not expect you to test this with 3rd party plugins, but to ping the people who are involved with those 3rd party plugins just to be sure that you’ll not break the functionality for them.

There was still a testing process on your PR, meaning there is activity on it and it won’t be stuck forever. You just needed to have some patience …

Looking at the PR's history, we can see that I have been waiting months at a time for replies.

Again, if you find an issue, please report it so I can fix it as soon as possible.

I'm sure that if I pick any random PR that was merged in the last months, the approval messages will not be comparable to these.

From a cursory glance, that doesn't seem to be the case.

To be honest, it is very sad how the community handles some PRs. We have lots and lots of PRs which affected a lot more code without the same level of documentation, clarity, attention, and tests as this one; to see this kind of pushback now is disappointing. If we want to raise the bar on PRs, let's apply the same standard to all PRs, and not only to a handpicked one.

This is the reason I'm raising this - the standard should be applied to all. Your help in this will be appreciated :)

Please see my reply to Daan, I gave some examples of what PRs I was talking about.

@DaanHoogland
Copy link
Contributor

Again, it seems like the concern is not based on technical aspects.

@JoaoJandre , I understand your frustration and you are free to scrutenize others PRs to the same level. Also I don not think you were wrong in merging it! That does not mean that others can have that opinion anyway. I am pretty sure (as far as I have heard) that the concerns were purely technical.

@weizhouapache
Copy link
Member

@JoaoJandre

since you mentioned one of my PRs, I think I should explain

In my opinion, I think this PR is not comparable with my PR

  • this PR changes the existing volume snapshot feature
  • there are multiple implementation for volume snapshot feature on different storages
  • the OS/libvirt/qemu/format may matters

IMHO, networking might be the most complex component for cloud platforms, but storage is the most important component. users or customers can accept few minutes downtime, but data loss is totally unacceptable.

I totally understand that the community are very careful to this PR, due to the risk of potential data loss. As I remember around 2 year ago, there was a PR related to volume or vm migration which caused to the corrupted image (sorry I do not remember clearly).

Many community members involved are experienced engineers , they have own analysis on the risk. Actually some community members are testing or going to test this PR

@nvazquez
Copy link
Contributor

Hi guys, sorry I've not been entirely involved on this PR but would like to give my opinion from the conversations around the last couple of days.
@JoaoJandre I believe there has been some good collaboration on this PR until around the time of merging, after which the technical concerns were raised. Perhaps there has been some miscommunication or different expectations around it, and in my opinion this PR could be seen as an example to prevent future situations similar to this one. I think the intention to merge the PR could have been communicated with some days notice as well on the PR, so for example @sureshanaparti @slavkap @rohityadavcloud could have planned to perform more tests before that. Just my 2 cents on the topic, other than that it looks like a great PR!

@JoaoJandre
Copy link
Contributor Author

Hi @weizhouapache,

since you mentioned one of my PRs, I think I should explain

I selected the first big changes that had been recently merged and did not think about who the author was. My objective is not to point fingers at a single person (the PRs I mentioned all have different authors); I simply meant to show that what is happening here is not the norm in our community.

* [New feature: Dynamic and Static Routing #9470](https://github.com/apache/cloudstack/pull/9470) has been tested by @kiranchavala , you can find the test results at [New feature: Dynamic and Static Routing #9470 (review)](https://github.com/apache/cloudstack/pull/9470#pullrequestreview-2280212394)

* It is a new feature, it is not supported by any network providers at that time. Now it is supported by NSX. The support for another network provider is being developed

I would argue that incremental snapshots are a new feature for KVM as well. Furthermore, it is disabled by default. Therefore, if there is an issue or nobody wants to use it, it is a matter of not activating the setup. Also, it only affects (when activated) file-based storage types such as NFS, local storage, and shared mount point storage (e.g. iSCSI/FC with OCFS2/Gluster/or any other) with KVM.

* I do not think upgrade test is required for the feature.

Fair point. I also think that the upgrade tests done in this PR are more than enough.

I totally understand that the community are very careful to this PR, due to the risk of potential data loss. As I remember around 2 year ago, there was a PR related to volume or vm migration which caused to the corrupted image (sorry I do not remember clearly).

I remember this, I was the one who eventually fixed the bug that the PR you are mentioning was trying to fix (see the last paragraph of the description on #8911).

Many community members involved are experienced engineers , they have own analysis on the risk. Actually some community members are testing or going to test this PR

I am well aware of the work the community's experienced engineers have put in the community. I hope that they can validate the PR and report any issues, so I can fix them, as it has been done numerous times within the community.

@JoaoJandre
Copy link
Contributor Author

Hi @nvazquez,

I think the intention to merge the PR could have been communicated with some days notice as well on the PR, so for example @sureshanaparti @slavkap @rohityadavcloud could have planned to perform more tests before that.

I would like to note that I gave notice saying I was going to merge: #9270 (comment). But the only reply I got was that "more tests would be good" (some of which were already done). If I got something along the lines of "Hey, I'm testing this, will post in the next days", I would have waited.

But simply stating "I need to perform more tests", taking into account the context of the PR, was not a valid concern in my opinion. As I said before, we can imagine an infinite amount of tests that can be done in a platform as complex as ours, but we have to draw the line somewhere.

Just my 2 cents on the topic, other than that it looks like a great PR!

Thanks :)

@JoaoJandre
Copy link
Contributor Author

@DaanHoogland, @nvazquez, @weizhouapache, @GutoVeronezi, @slavkap, @rohityadavcloud and @sureshanaparti

I think that what has happened is very clear to everyone, and further discussion will only waste everyone's time. I would like to state again that: If you find any bugs in the feature, please report them so I can patch them as soon as possible. I will go with the suggestion @DaanHoogland gave yesterday: "I think we should stop this discussion now, without any further remarks on this thread.".

dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jun 19, 2025
* KVM incremental snapshot feature

* fix log

* fix merge issues

* fix creation of folder

* fix snapshot update

* Check for hypervisor type during parent search

* fix some small bugs

* fix tests

* Address reviews

* do not remove storPool snapshots

* add support for downloading diff snaps

* Add multiple zones support

* make copied snapshots have normal names

* address reviews

* Fix in progress

* continue fix

* Fix bulk delete

* change log to trace

* Start fix on multiple secondary storages for a single zone

* Fix multiple secondary storages for a single zone

* Fix tests

* fix log

* remove bitmaps when deleting snapshots

* minor fixes

* update sql to new file

* Fix merge issues

* Create new snap chain when changing configuration

* add verification

* Fix snapshot operation selector

* fix bitmap removal

* fix chain on different storages

* address reviews

* fix small issue

* fix test

---------

Co-authored-by: João Jandre <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

KVM Incremental Snapshots/Backups