Skip to content

Conversation

@nvazquez
Copy link
Contributor

Description

This PR prevents a corner case for the storage cleanup task observed in Vmware in which:

  • There are storage pool migrations for the ROOT disk of a VM
  • After the migrations, the volumes table creates new volume records, marking the ones to be deleted with state = 'Destroy'
  • The storage cleanup task will attempt to expunge the volumes in 'Destroy' state, but if there are underlying storage issues then this task is not completed
  • Once the storage issues are solved and the current volume is on the same storage pool as one of the volumes to be removed then the VM can be completely removed

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

@apache apache deleted a comment from blueorangutan Oct 13, 2021
@nvazquez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

destVol.setUuid(uuid);
destVol.setInstanceId(instanceId);
// Prevent storage cleanup corner case of ROOT volume deletion
srcVol.setVolumeType(Type.DATADISK);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could that cause any regression, does this save the VO in DB?

Copy link
Contributor Author

@nvazquez nvazquez Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently testing for regressions. This saves the removed volume after storage migrations in DB as datadisk. @weizhouapache and I are considering the approach taken on restore VM as well: https://github.com/apache/cloudstack/blob/main/server/src/main/java/com/cloud/vm/UserVmManagerImpl.java#L7407-L7409

@nvazquez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1563

@nvazquez
Copy link
Contributor Author

@blueorangutan test matrix

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-2408)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32285 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5575-t2408-xenserver-71.zip
Smoke tests completed. 91 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

Trillian test result (tid-2409)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32260 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5575-t2409-kvm-centos7.zip
Smoke tests completed. 91 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

Trillian test result (tid-2410)
Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7
Total time taken: 34629 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5575-t2410-vmware-65u2.zip
Smoke tests completed. 90 look OK, 1 have errors
Only failed tests results shown below:

Test Result Time (s) Test File
test_create_pvlan_network Error 0.03 test_pvlan.py

@rohityadavcloud rohityadavcloud added this to the 4.16.0.0 milestone Oct 14, 2021
@weizhouapache weizhouapache reopened this Oct 14, 2021
Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@nvazquez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the vmware storage processor addresses this in a harmfull way;
"to destroy a volume; get the vm and destroy that"
I don't think we can solve it there now, but it should be addressed there as well
it is formally right to detach the volume when it should no longer be associated with the VM so this CLGTM

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1568

@nvazquez
Copy link
Contributor Author

@blueorangutan test matrix

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved again

@vladimirpetrov
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@vladimirpetrov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@nvazquez nvazquez marked this pull request as ready for review October 15, 2021 11:02
@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✖️ debian ✔️ suse15. SL-JID 1572

@nvazquez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1575

@rohityadavcloud
Copy link
Member

@blueorangutan test centos7 vmware-67u3

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + vmware-67u3) has been kicked to run smoke tests

Copy link
Contributor

@vladimirpetrov vladimirpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on manual testing. Steps used:

(1) Create a VMWare 6.7u3 env with 2 pools (pool1, pool2).
(2) Create L2 network, and a VM. The created volume is on pool2.
(3) Put pool2 in maintenance mode, the VM is stopped.
(4) Start the VM. Its volume is copied to pool1.
(5) There are two records in the database with the same name ‘ROOT-xx’. Update the 'Expunged' volume like this:

MariaDB [cloud]> update volumes set removed=null,state='Destroy' where id=x;

(6) Change the following global settings, and restart the management server:

storage.cleanup.interval -> to 60 (seconds)
storage.cleanup.delay -> to 60 (seconds)

systemctl restart cloudstack-management

Result: Only the inactive volume is removed (expunged) and our VM with the active volume is still running fine.

2021-10-15 15:13:10,160 INFO [c.c.s.r.VmwareStorageProcessor] (DirectAgent-1:ctx-5a08de4b 10.0.35.14, cmd: DeleteCommand) (logid:5cbc532f) Executing resource DeleteCommand: {"data":{"org.apache.cloudstack.storage.to.VolumeObjectTO":{"volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"43c3140e-8caf-38b8-9540-238fa5409707","name":"ref-trl-2009-v-M7-vladimir-petrov-esxi-pri2","id":2,"poolType":"NetworkFilesystem","host":"10.0.32.4","path":"/acs/primary/ref-trl-2009-v-M7-vladimir-petrov/ref-trl-2009-v-M7-vladimir-petrov-esxi-pri2","port":2049,"url":"NetworkFilesystem://10.0.32.4/acs/primary/ref-trl-2009-v-M7-vladimir-petrov/ref-trl-2009-v-M7-vladimir-petrov-esxi-pri2/?ROLE=Primary&STOREUUID=43c3140e-8caf-38b8-9540-238fa5409707","isManaged":false}},"name":"ROOT-6","size":2147483648,"path":"ROOT-6","volumeId":9,"accountId":2,"chainInfo":"{"diskDeviceBusName":"ide0:1","diskChain":["[43c3140e8caf38b89540238fa5409707] i-2-6-VM/ROOT-6.vmdk","[43c3140e8caf38b89540238fa5409707] e765b12f796432fc80cd4091e8c59f32/e765b12f796432fc80cd4091e8c59f32.vmdk"]}","format":"OVA","provisioningType":"THIN","poolId":2,"id":9,"hypervisorType":"VMware","directDownload":false,"deployAsIs":false}},"wait":0,"bypassHostMaintenance":false}

@blueorangutan
Copy link

Trillian test result (tid-2422)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 37214 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5575-t2422-vmware-67u3.zip
Smoke tests completed. 90 look OK, 1 have errors
Only failed tests results shown below:

Test Result Time (s) Test File
test_04_autoscale_kubernetes_cluster Failure 35.96 test_kubernetes_clusters.py

@nvazquez nvazquez merged commit a5372a9 into apache:main Oct 16, 2021
@nvazquez nvazquez deleted the fixrootvoldeletion branch February 12, 2022 02:38
shwstppr pushed a commit to shapeblue/cloudstack that referenced this pull request Mar 17, 2023
* Fix storage cleanup corner case

* Improve deletion

* Refactor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants