Skip to content

Conversation

@GabrielBrascher
Copy link
Member

Description

This PR fixes issue #5407. In summary, volumes that should not be migrated are still being mapped in VirtualMachineManagerImpl.createStoragePoolMappingsForVolumes.

To give some context: method createStoragePoolMappingsForVolumes is used only when migrating a VM with its volume(s), therefore via the API call migrateVirtualMachineWithVolume.

In such cases, there are two options:
A. User does not provide volumes, just the VM ID, and Host ID
B. User provides VM ID, Host ID, and a volume map, e.g. migrateto[0].volume=71f43cd6-69b0-4d3b-9fbc-67f50963d60b&migrateto[0].pool=a382f181-3d2b-4413-b92d-b8931befa7e1&migrateto[1].volume=88de0173-55c0-4c1c-a269-83d0279eeedf&migrateto[1].pool=95d6e97c-6766-4d67-9a30-c449c15011d1.

However, regardless of case A or B, createStoragePoolMappingsForVolumes iterates all the VM's volumes and adds to be migrated. Therefore, this causes exceptions when migrating VMs with data disk attached whenever the data-disk (shared) should not be migrated.

From my point of view, it does not make sense to add volumes that the user has not mapped to be migrated unless the VM is migrated to (I) different cluster or (II) the volume scope is LOCAL. Both of these cases are already covered by the if statement previous to the removed line in this PR.

if (ScopeType.HOST.equals(currentPool.getScope()) || isStorageCrossClusterMigration(plan.getClusterId(), currentPool)) {
                createVolumeToStoragePoolMappingIfPossible(profile, plan, volumeToPoolObjectMap, volume, currentPool);
}

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

Migrate VM with Root disk in local storage.

Tests have been done via the UI as well as the API. Also, it considered on both cases:
A. User does not provide volumes, just the VM ID, and HostID;
B. User provides VM ID, and HostID, and a volume map, e.g. migrateto[0].volume=71f43cd6-69b0-4d3b-9fbc-67f50963d60b&migrateto[0].pool=a382f181-3d2b-4413-b92d-b8931befa7e1&migrateto[1].volume=88de0173-55c0-4c1c-a269-83d0279eeedf&migrateto[1].pool=95d6e97c-6766-4d67-9a30-c449c15011d1.

In both cases, the volume map has been done accordingly with the expected behavior.

VMs that fail to be migrated, have been successfully migrated and only the correct volume was migrated; data disks remained attached to the VM, at their storage pool, and in "Ready" state.

@GabrielBrascher GabrielBrascher added this to the 4.16.0.0 milestone Sep 6, 2021
@GabrielBrascher GabrielBrascher self-assigned this Sep 6, 2021
@GabrielBrascher
Copy link
Member Author

This has been tested with KVM.

Pinging some of the experts in managed storages (@mike-tutkowski @slavkap) to check if there are any cases I am missing where such volume mapping is correct.

Also, I would appreciate feedback regarding different hypervisors, such as VMware and Xen (@rhtyd @shwstppr @nvazquez @DaanHoogland).

Copy link
Contributor

@nvazquez nvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 2935 to 2951
} else {
volumeToPoolObjectMap.put(volume, currentPool);
Copy link
Contributor

@shwstppr shwstppr Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GabrielBrascher as you have seen allVolumes is a confusing name here. It is only the volumes that are not mapped using API. I've not tested but VMware might expect mappings for each of the volumes of the VM.
I feel instead of preventing adding a map here, we should look into the code that does migration and skip volumes that are already on the destination pool. For KVM this could be,
https://github.com/apache/cloudstack/blob/main/engine/storage/datamotion/src/main/java/org/apache/cloudstack/storage/motion/StorageSystemDataMotionStrategy.java#L1787
https://github.com/apache/cloudstack/blob/main/engine/storage/datamotion/src/main/java/org/apache/cloudstack/storage/motion/StorageSystemDataMotionStrategy.java#L1978-L1980

@shwstppr
Copy link
Contributor

shwstppr commented Sep 7, 2021

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1145

@shwstppr
Copy link
Contributor

shwstppr commented Sep 7, 2021

Tested offline migration on VMware env and the volume for which mapping was not provided in API call got detached from the VM in vCenter.

  • Migrated a stopped with a ROOT and a DATA disk within the cluster.
  • Both volumes were on same primary store before migration - ps1
  • Migrated with migrateVirtualMachineWithVolume API by passing a mapping for ROOT volume for primary store ps1-1

@GabrielBrascher
Copy link
Member Author

Thanks for reviewing and testing @shwstppr @nvazquez!

I will check different approaches as this could impact VMware migrations.
When another approach is ready for review I will ping you.

@GabrielBrascher GabrielBrascher marked this pull request as draft September 7, 2021 11:58
@GabrielBrascher GabrielBrascher force-pushed the kvm-local-and-datadisk-migrate branch 4 times, most recently from 2925a11 to 9dfd1ec Compare September 8, 2021 20:38
@GabrielBrascher GabrielBrascher marked this pull request as ready for review September 8, 2021 20:45
@GabrielBrascher
Copy link
Member Author

@shwstppr @nvazquez PR is ready for review.

@GabrielBrascher
Copy link
Member Author

I've found some issues when having Local + NFS attached, with Local + RBD data disk all works fine.

I will check the workflow with the NFS tests feedback.

@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖️ el7 ✖️ el8 ✖️ debian ✖️ suse15. SL-JID 1209

@nvazquez
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 1230

@GabrielBrascher
Copy link
Member Author

I am running a couple more tests. However, considering that the 4.16 cut will be soon, I might reduce the scope to fix only RBD (datadisk) + Local (Root). And then block migrations with NFS (datadisk) + Local (Root).

@sureshanaparti
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@apache apache deleted a comment from blueorangutan Oct 6, 2021
@apache apache deleted a comment from blueorangutan Oct 6, 2021
@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@sureshanaparti
Copy link
Contributor

@blueorangutan test matrix

@blueorangutan
Copy link

@sureshanaparti a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-2340)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33114 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5410-t2340-kvm-centos7.zip
Smoke tests completed. 91 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

Trillian test result (tid-2339)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33255 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5410-t2339-xenserver-71.zip
Smoke tests completed. 91 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

Trillian test result (tid-2341)
Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7
Total time taken: 37299 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5410-t2341-vmware-65u2.zip
Smoke tests completed. 90 look OK, 1 have errors
Only failed tests results shown below:

Test Result Time (s) Test File
test_create_pvlan_network Error 0.03 test_pvlan.py

@nvazquez
Copy link
Contributor

nvazquez commented Oct 7, 2021

@blueorangutan test centos7 vmware-67u3

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + vmware-67u3) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian Build Failed (tid-2348)

@nvazquez
Copy link
Contributor

nvazquez commented Oct 7, 2021

@blueorangutan test centos7 vmware-67u3

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + vmware-67u3) has been kicked to run smoke tests

@apache apache deleted a comment from blueorangutan Oct 7, 2021
@blueorangutan
Copy link

Trillian test result (tid-2349)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 36103 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5410-t2349-vmware-67u3.zip
Smoke tests completed. 91 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

Copy link
Contributor

@nvazquez nvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@GutoVeronezi GutoVeronezi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rohityadavcloud
Copy link
Member

@sureshanaparti @vladimirpetrov @borisstoyanov @nvazquez do we need any further manual testing before merging this?

@sureshanaparti
Copy link
Contributor

@sureshanaparti @vladimirpetrov @borisstoyanov @nvazquez do we need any further manual testing before merging this?

@rhtyd yes, requires some manual tests for VMware migration cases.

@sureshanaparti
Copy link
Contributor

sureshanaparti commented Oct 8, 2021

@GabrielBrascher @nvazquez @rhtyd Tested the following VM migration with volume(s) scenarios, with VMware 6.7.0. All passed.

  1. VM With local Root disk
    (i) Without volumes mapping (in same cluster)
    (ii) With ROOT volume mapping (in same cluster)
    (iii) Without volumes mapping (across different clusters, in the same pod)

  2. VM With local Root disk + Shared (NFS) DATA disk
    (i) Without volumes mapping (in same cluster)
    (ii) With ROOT volume mapping (in same cluster)
    (iii) With DATA volume mapping (in same cluster)
    (iv) With ROOT and DATA volumes mapping (in same cluster)
    (v) Without volumes mapping (across different clusters, in the same pod)
    (vi) With ROOT volume mapping (across different clusters, in the same pod)
    (vii) With ROOT and DATA volumes mapping (across different clusters, in the same pod)

Noticed issue (#5558) with storage type/offering for the volume, not related to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants