Skip to content

Conversation

@GutoVeronezi
Copy link
Contributor

@GutoVeronezi GutoVeronezi commented Aug 9, 2021

Description

This PR solves issue #5124.

The main focus of it is to change how volume snapshots are taken in KVM. However, this proposal of changing take snapshot workflow in KVM impacts others snapshot workflows as well. The impacted workflows we mapped are:

  • Take volume snapshot with running vm
  • Take volume snapshot with stopped vm
  • Recurring volume snapshot
  • Create template from snapshot
  • Create volume from snapshot
  • Migrate virtual machine with volumes
  • Migrate volumes
  • Revert volume snapshot
  • Delete volume snapshot

Issue #5124 has more details and flowcharts of all the mapped workflows.

The focus of this PR is to refactor KVM snapshot, others hypervisors workflows should not be affected.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

How Has This Been Tested?

I added unit tests to several methods and tested the workflows:

  1. Take volume snapshot with running vm
  2. Take volume snapshot with stopped vm
  3. Recurring volume snapshot
  4. Create template from snapshot
  5. Create volume from snapshot
  6. Migrate virtual machine with volumes
  7. Migrate volumes
  8. Revert volume snapshot
  9. Delete volume snapshot

These workflows were tested twice, once with the global setting snapshot.backup.to.secondary as true and once as false.

To test workflows 1, 2, and 3, I created a VM called Test1, then I opened its console to create a file, which I called testfile. At first, I wrote gold in this file and took a snapshot that I called gold (this tests workflow 1). Then I wrote silver to the file and configured a recurring snapshot. I waited for the recurring snapshot and, when it was took, I changed the name in database to silver, to facilitate the steps (this tests workflow 2). Finally, I wrote copper to the file and stopped the VM, then I took a snapshot (this tests workflow 3). After that, I destroyed VM Test1.

To test workflow 4, I created a template from snapshot silver and called it template silver. From it, I created a VM called Test2, then I opened its console and retrieved the content of the testfile with the command cat. It returned silver, which is the same value of the file when I took the snapshot on workflow 2. After that, I stopped VM Test2.

To test workflow 5, I created a volume from snapshot gold and called it volume gold, then I changed the root volume of the VM Test2 to the volume gold. I started the VM Test2 and retrieved the content of the testfile with the command cat. It returned gold, which is the same value of the file when I took the snapshot on workflow 1.

To test workflows 6 and 7, I wrote platinum to the file testfile in VM Test2, then I took a snapshot and called it platinum. I migrated the VM and the volume to another host and primary storage with the API migrateVirtualMachineWithVolume. When the global setting snapshot.backup.to.secondary was true, it migrated successfully. When it was false, an exception was threw, warning user about snapshots on primary storage (this tests workflow 6). I stopped the VM and migrated only the volume to another primary storage. When the global setting snapshot.backup.to.secondary was true, it migrated successfully. When it was false, an exception was threw, warning users about snapshots on primary storagee (this tests workflow 7).

To test workflow 8, I started VM Test2 and wrote diamond to the file testfile. Then I stopped the VM Test2 and reverted the volume to the snapshot platinum. Then I opened its console and retrieved the content of the testfile with the command cat. It returned platinum, which is the same value of the file when I took the snapshot before the workflow 6.

To test workflow 9, I deleted all the snapshots.

Copy link
Contributor

@RodrigoDLopez RodrigoDLopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice enhancement @GutoVeronezi
I would like it if you could explain your tests more clearly. I think just organize your text will help a lot.
The code LGTM.

@nvazquez
Copy link
Contributor

Thanks @GutoVeronezi I'll post my review this week

* 2nd parameter: disk's label (target.dev tag from VM's XML);<br>
* 3rd parameter: the absolute path of the base file;
*/
private static final String COMMAND_MERGE_SNAPSHOT = "virsh blockcommit %s %s --base %s --active --wait --delete --pivot";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GutoVeronezi, one suggestion - the latest Libvirt's Java API bindings support qemu-monitor-command and maybe the command block-commit could replace this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke offline with @GutoVeronezi, and the flag --delete does not work with the version of libvirt that I have

Using library: libvirt 5.0.0
Using API: QEMU 5.0.0
Running hypervisor: QEMU 2.12.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @slavkap, I will make some changes soon.

@GutoVeronezi
Copy link
Contributor Author

GutoVeronezi commented Aug 13, 2021

@RodrigoDLopez thanks for the review. I just updated the tests description. If there is anything to improve, let me know.

GutoVeronezi added 2 commits August 16, 2021 10:18
@GutoVeronezi
Copy link
Contributor Author

GutoVeronezi commented Aug 16, 2021

Regarding to @slavkap #5297 (comment), the feature blockcommit was introduced on Libvirt's version 1.2.9, according to its docs. However some tags were introduced as a future work mark. The tag delete was really implemented only in version 6.0.0, through the commit qemu: block: enable the snapshot image deletion feature.
Therefore, for cases which operators cannot upgrade Libvirt (like in CentOS 7) to a version that supports this flag, we created a workaround, manually deleting the unused snapshot file. As soon as we have everybody using a version equal or higher than 6.0.0, we can remove this workaround.

storagePoolMgr.getStoragePoolByURI(snapshotImageStore.getUrl());
}

if (primaryPool.getType() == StoragePoolType.CLVM) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be part of storagePoolTypesThatSupportRevertSnapshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvazquez reverting snapshots that were only in primary storage was not working correctly for KVM. The RBD validation was added in PR #3502 to allow Ceph to revert snapshots that were only in primary storage, however the rest kept not working. This PR fixed this error to some others storage pool types (which were added to the validation), however CLVM was not our focus, therefore we tried to not touch in it.

@rohityadavcloud
Copy link
Member

@blueorangutan package

Copy link
Contributor

@nvazquez nvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM - have not tested it

@nvazquez
Copy link
Contributor

nvazquez commented Apr 8, 2022

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖️ el7 ✖️ el8 ✖️ debian ✔️ suse15. SL-JID 3112

@GutoVeronezi
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@GutoVeronezi a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 3118

@nvazquez
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3853)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32200 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5297-t3853-kvm-centos7.zip
Smoke tests completed. 93 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@nvazquez
Copy link
Contributor

Merging based on approvals and test results

@nvazquez nvazquez merged commit 39fad2d into apache:main Apr 12, 2022
@GutoVeronezi GutoVeronezi mentioned this pull request Aug 10, 2022
10 tasks
rohityadavcloud pushed a commit that referenced this pull request Oct 13, 2022
ACS + Xenserver works with differential snapshots. ACS takes a volume full snapshot and the next ones are referenced as a child of the previous snapshot until the chain reaches the limit defined in the global setting snapshot.delta.max; then, a new full snapshot is taken. PR #5297 introduced disk-only snapshots for KVM volumes. Among the changes, the delete process was also refactored. Before the changes, when one was removing a snapshot with children, ACS was marking it as Destroyed and it was keeping the Image entry on the table cloud.snapshot_store_ref as Ready. When ACS was rotating the snapshots (the max delta was reached) and all the children were already marked as removed; then, ACS would start removing the whole hierarchy, completing the differential snapshot cycle. After the changes, the snapshots with children stopped being marked as removed and the differential snapshot cycle was not being completed.

This PR intends to honor again the differential snapshot cycle for XenServer, making the snapshots to be marked as removed when deleted while having children and following the differential snapshot cycle.

Also, when one takes a volume snapshot and ACS backs it up to the secondary storage, ACS inserts 2 entries on table cloud.snapshot_store_ref (Primary and Image). When one deletes a volume snapshot, ACS first tries to remove the snapshot from the secondary storage and mark the entry Image as removed; then, it tries to remove the snapshot from the primary storage and mark the entry Primary as removed. If ACS cannot remove the snapshot from the primary storage, it will keep the snapshot as BackedUp; however, If it does not exist in the secondary storage and without the entry SNAPSHOT.DELETE on cloud.usage_event. In the end, after the garbage collector flow, the snapshot will be marked as BackedUp, with a value in the field removed and still being rated. This PR also addresses the correction for this situation.

Co-authored-by: GutoVeronezi <[email protected]>
rohityadavcloud added a commit to shapeblue/cloudstack that referenced this pull request Oct 10, 2023
…allows

This removes the conditional logic where comment notest to remove it
after PR apache#5297 is merged that is applicable for ACS 4.18+. Only when the
global setting is enabled and memory isn't selected, VM snapshot could
be allowed for VMs on KVM that have qemu-guest-agent running.

Signed-off-by: Rohit Yadav <[email protected]>
rohityadavcloud added a commit that referenced this pull request Oct 11, 2023
…g allows (#8062)

This removes the conditional logic where comment notest to remove it
after PR #5297 is merged that is applicable for ACS 4.18+. Only when the
global setting is enabled and memory isn't selected, VM snapshot could
be allowed for VMs on KVM that have qemu-guest-agent running.

Signed-off-by: Rohit Yadav <[email protected]>
rp- pushed a commit to LINBIT/cloudstack that referenced this pull request Oct 13, 2023
…g allows (apache#8062)

This removes the conditional logic where comment notest to remove it
after PR apache#5297 is merged that is applicable for ACS 4.18+. Only when the
global setting is enabled and memory isn't selected, VM snapshot could
be allowed for VMs on KVM that have qemu-guest-agent running.

Signed-off-by: Rohit Yadav <[email protected]>
shwstppr pushed a commit to shapeblue/cloudstack that referenced this pull request Nov 3, 2023
…g allows (apache#8062)

This removes the conditional logic where comment notest to remove it
after PR apache#5297 is merged that is applicable for ACS 4.18+. Only when the
global setting is enabled and memory isn't selected, VM snapshot could
be allowed for VMs on KVM that have qemu-guest-agent running.

Signed-off-by: Rohit Yadav <[email protected]>
(cherry picked from commit 8350ce5)
Signed-off-by: Rohit Yadav <[email protected]>
rohityadavcloud added a commit to shapeblue/cloudstack that referenced this pull request Mar 31, 2024
…g allows (apache#8062)

This removes the conditional logic where comment notest to remove it
after PR apache#5297 is merged that is applicable for ACS 4.18+. Only when the
global setting is enabled and memory isn't selected, VM snapshot could
be allowed for VMs on KVM that have qemu-guest-agent running.

Signed-off-by: Rohit Yadav <[email protected]>
(cherry picked from commit 8350ce5)
Signed-off-by: Rohit Yadav <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

KVM disk-only based snapshot of volumes instead of taking VM's full snapshot and extracting disks

10 participants