-
Notifications
You must be signed in to change notification settings - Fork 1.2k
KVM incremental snapshot feature #9270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9270 +/- ##
===========================================
Coverage 16.40% 16.41%
- Complexity 13590 13622 +32
===========================================
Files 5692 5699 +7
Lines 501976 503131 +1155
Branches 60795 60940 +145
===========================================
+ Hits 82369 82568 +199
- Misses 410449 411387 +938
- Partials 9158 9176 +18
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Good job @JoaoJandre |
second that, tnx |
core/src/main/java/com/cloud/storage/resource/StorageSubsystemCommandHandlerBase.java
Outdated
Show resolved
Hide resolved
|
@blueorangutan test rocky8 kvm-rocky8 |
|
@JoaoJandre nice one. Just one remark, the following sentence sounds contradictory to me "Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC)", if it supports iSCSI and FC (through a shared mountpoint) it does support block storage, I think the phrasing could cause some confusion as to which types of storage are supported. |
Hey @alexandremattioli, I understand your confusion. However, when using shared mount-point, as far as ACS is concerned, the storage is file-based, we will not be working with blocks directly, only files (as ACS does already for shared mount point). The mentions on parenthesis are there to give an example of underlying storages that might be behind the shared mount-point. I have updated the description to add a little more context. |
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
|
hm, seen this a couple of times now; the bot removes and adds the has-conflicts labels in the same second and a PR without conflicts ends out being marked as having so :( |
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
@sureshanaparti The PR already meets our standard prerequisites for merging. There is an exponential number of tests that can be done in any PR mixing and matching distros, versions and so on; eventually we have to draw the line when to stop testing. We have already tested with multiple distros, multiple storage types and we have tested upgrades from older versions. This PR only focuses on NFS/SharedMountPoint/LocalStorage storages; thus other types of storages are not affected. If you do not have a specific concern regarding the PR, I will merge it. In the future, when a bug is reported we will fix it, as it has been done for any other feature. |
Hi @JoaoJandre & all concerned, I feel we have to check the behavior with
|
|
Merging based on approvals, testing done by @slavkap, @sureshanaparti, @winterhazel and @bernardodemarco. @sureshanaparti if you think that this PR needs further testing, this may be done while the feature is in the main branch until 4.21 is released; which is the moment where we test everything again. If any issues arise we can fix them before the release. Let me know if you find something, and I will work to patch it ASAP. |
|
@JoaoJandre , I think both @sureshanaparti and @slavkap still want to do tests, which is fine by me. |
|
Thanks for the work @JoaoJandre and all involved. Fwiw, my only concern remains upgrade testing when a user upgrades to a version with this feature, how will this behave for existing instances & their snapshots (also upgrade/transioning b/w qcow v2 vs qcow v3). Was any upgrade related tests done for supported storages for this feature (i.e. nfs, shared mount point & local storages) and with supported KVM distros (Ubuntu 22.04, 24.40; EL 8 & 9; openSUSE 15) / |
|
I also want to thank the people involved with this PR, but I also want to note that the way it was merged was not correct. I expected:
|
Hello, @rohityadavcloud This feature is disabled by default, so upgrading should not pose any issues to the users. This was also tested here #9270 (comment). This feature was tested with multiple distros, and I personally tested it with local, NFS, and shared mount point storage systems. |
Hi @slavkap
To be honest, it is very sad how the community handles some PRs. We have lots and lots of PRs which affected a lot more code without the same level of documentation, clarity, attention, and tests as this one; to see this kind of pushback now is disappointing. If we want to raise the bar on PRs, let's apply the same standard to all PRs, and not only to a handpicked one. |
Sorry, I dissagree @JoaoJandre . We must be able to cherrypick and give more scrutiny to PRs that we consider more risky. I understand your dissapointment as I see this PR has seen long periods of silence and also that you have had other people do extensive testing on your change. I also don’t think you were very wrong in merging it, but there are people that would have wanted to test it more before merging. On the one hand they did say that and on the other hand they were not as explicit as they could have been. I do not think much harm is done either way. Let’s all just continue and make sure v21 is going to be tested extensively, including this functionality. |
|
@JoaoJandre, I’m a bit concerned by the tone you’re setting because a community member is expressing their point of view. I’ve been in the community for almost six years, and this is the first time I've seen such a tone in a discussion.
I don't think that the lazy consensus applies in this case. If this is correct, then all PRs should be merged after 72h if there aren't any objections? You can check the last 2 paragraphs on how patches are applied - link Requesting that everything gets reviewed within 72 hours does not reflect the reality that most people working on this project are doing it in their free time, thus forcing a response time on them is not going to be productive.
Two people were from your company, and most of their scenarios are handled by the smoke tests (which are for NFS storage only); Тhe other two people were still testing this.
Most of the users are using those 3rd party plugins. And no, I do not expect you to test this with 3rd party plugins, but to ping the people who are involved with those 3rd party plugins just to be sure that you’ll not break the functionality for them. There was still a testing process on your PR, meaning there is activity on it and it won’t be stuck forever. You just needed to have some patience …
From a cursory glance, that doesn't seem to be the case.
This is the reason I'm raising this - the standard should be applied to all. Your help in this will be appreciated :) |
@slavkap, please, revisit the Apache Way and Individuals compose the ASF; your statement and criteria go against our community principles. |
@GutoVeronezi , I don’ t see how @slavkap is going against the Apache way. I think we should stop this discussion now, without any further remarks on this thread. |
@DaanHoogland, from the links: |
|
Hello @DaanHoogland, I understand the point that a PR that changes a single line in the UI needs less scrutiny than one that changes 3k lines and 75 files. However, my point is that many PRs that have the same impact (or even more) than mine are not scrutinized as much. Thus, I'm left wondering, what is this concern really about; if it is the code or something else. To give you a few examples:
Again, it seems like the concern is not based on technical aspects. |
I know that we do not use lazy consensus to merge PRs. However, I was not refering to using lazy consensus to merge (we already had the pre-requisites for merging), I was refering to waiting for your reply regarding the problems you originally reported, which I reproduced and fixed. I did not expect the community to validate the PR within 72 hours, and it was not done within this timeframe. The PR was open for 10 months. The last test you performed was 2 months ago, and after I fixed the concerns, I did not hear a reply regarding that. I think it was reasonable to assume that you were ok and were not going to test anymore.
We are a community, inside the Apache software foundation we should conduct ourselves as individuals, not as part of a company.
Looking at the PR's history, we can see that I have been waiting months at a time for replies. Again, if you find an issue, please report it so I can fix it as soon as possible.
Please see my reply to Daan, I gave some examples of what PRs I was talking about. |
@JoaoJandre , I understand your frustration and you are free to scrutenize others PRs to the same level. Also I don not think you were wrong in merging it! That does not mean that others can have that opinion anyway. I am pretty sure (as far as I have heard) that the concerns were purely technical. |
since you mentioned one of my PRs, I think I should explain
In my opinion, I think this PR is not comparable with my PR
IMHO, networking might be the most complex component for cloud platforms, but storage is the most important component. users or customers can accept few minutes downtime, but data loss is totally unacceptable. I totally understand that the community are very careful to this PR, due to the risk of potential data loss. As I remember around 2 year ago, there was a PR related to volume or vm migration which caused to the corrupted image (sorry I do not remember clearly). Many community members involved are experienced engineers , they have own analysis on the risk. Actually some community members are testing or going to test this PR |
|
Hi guys, sorry I've not been entirely involved on this PR but would like to give my opinion from the conversations around the last couple of days. |
|
Hi @weizhouapache,
I selected the first big changes that had been recently merged and did not think about who the author was. My objective is not to point fingers at a single person (the PRs I mentioned all have different authors); I simply meant to show that what is happening here is not the norm in our community.
I would argue that incremental snapshots are a new feature for KVM as well. Furthermore, it is disabled by default. Therefore, if there is an issue or nobody wants to use it, it is a matter of not activating the setup. Also, it only affects (when activated) file-based storage types such as NFS, local storage, and shared mount point storage (e.g. iSCSI/FC with OCFS2/Gluster/or any other) with KVM.
Fair point. I also think that the upgrade tests done in this PR are more than enough.
I remember this, I was the one who eventually fixed the bug that the PR you are mentioning was trying to fix (see the last paragraph of the description on #8911).
I am well aware of the work the community's experienced engineers have put in the community. I hope that they can validate the PR and report any issues, so I can fix them, as it has been done numerous times within the community. |
|
Hi @nvazquez,
I would like to note that I gave notice saying I was going to merge: #9270 (comment). But the only reply I got was that "more tests would be good" (some of which were already done). If I got something along the lines of "Hey, I'm testing this, will post in the next days", I would have waited. But simply stating "I need to perform more tests", taking into account the context of the PR, was not a valid concern in my opinion. As I said before, we can imagine an infinite amount of tests that can be done in a platform as complex as ours, but we have to draw the line somewhere.
Thanks :) |
|
@DaanHoogland, @nvazquez, @weizhouapache, @GutoVeronezi, @slavkap, @rohityadavcloud and @sureshanaparti I think that what has happened is very clear to everyone, and further discussion will only waste everyone's time. I would like to state again that: If you find any bugs in the feature, please report them so I can patch them as soon as possible. I will go with the suggestion @DaanHoogland gave yesterday: "I think we should stop this discussion now, without any further remarks on this thread.". |
* KVM incremental snapshot feature * fix log * fix merge issues * fix creation of folder * fix snapshot update * Check for hypervisor type during parent search * fix some small bugs * fix tests * Address reviews * do not remove storPool snapshots * add support for downloading diff snaps * Add multiple zones support * make copied snapshots have normal names * address reviews * Fix in progress * continue fix * Fix bulk delete * change log to trace * Start fix on multiple secondary storages for a single zone * Fix multiple secondary storages for a single zone * Fix tests * fix log * remove bitmaps when deleting snapshots * minor fixes * update sql to new file * Fix merge issues * Create new snap chain when changing configuration * add verification * Fix snapshot operation selector * fix bitmap removal * fix chain on different storages * address reviews * fix small issue * fix test --------- Co-authored-by: João Jandre <[email protected]>


Description
This PR solves issue #8907.
Currently, when taking a volume snapshot/backup with KVM as the hypervisor, it is always a full snapshot/backup. However, always taking full snapshots of volumes is costly for both the storage network and storage systems. To solve the aforementioned issues, this PR extends the volume snapshot feature in KVM, allowing users to create incremental volume snapshots using KVM as a hypervisor.
To give operators control over which type of snapshot is being created, a new global setting
kvm.incremental.snapshothas been added, which can be changed at zone and cluster scopes; this setting is false by default. Also, thesnapshot.delta.maxconfiguration, used to control the maximum deltas when using XenServer, was extended to also limit the size of the backing chain of snapshots on primary/secondary storage.This functionality is only available in environments with Libvirt 7.6.0+ and qemu 6.1+. If the
kvm.incremental.snapshotsetting is true, and the hosts do not have the required Libvirt and qemu versions, an error will be thrown when trying to take a snapshot. Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC that require a shared mount-point storage file system for KVM such as OCFS2 or GlusterFS), NFS, and local storage. Other storage types for KVM, such as CLVM and RBD, need different approaches to enable incremental backups; therefore, these are not currently supported.Issue #8907 has more details and flowcharts of all the mapped workflows.
Docs PR: apache/cloudstack-documentation#423 / apache/cloudstack-documentation#488
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Description of tests
During testing, the
kvm.incremental.snapshotsetting was changed totrueand thesnapshot.delta.maxsetting was changed to 3.Tests with snapshot.backup.to.secondary = false
For the tests in this section, a test VM was created and reused for all tests.
Snapshot creation tests
Snapshot restore tests
Snapshot removal tests
Template creation test
Tests with snapshot.backup.to.secondary = true
All tests performed in the previous sections were repeated with snapshot.backup.to.secondary = false, in addition, two additional tests were performed. For the tests in this section, a test VM was created and reused for all tests.
Snapshot creation tests
I have also tested that the bitmaps are removed once the snapshots are deleted.