Skip to content

Conversation

@BryanMLima
Copy link
Contributor

Description

This PR addresses an issue when live scaling a VM with the KVM hypervisor. When the VM has the CPU cores or CPU speed increased, the cpu_shares (priority) of the VM is not updated accordingly. To overcome that, it was needed to manually restart the VM or using the command: virsh schedinfo --domain <vm_internal_name> --live cpu_shares=<cpu_shares_value>. This PR changes this to automatically update the cpu_shares when the CPU cores or speed is increased. It is worth mention that the current behavior caused steal time when VMs would compete for the CPU after a dynamic scale process.

Steps to reproduce:

  • Host

    • 6 vCPUs X 2.39Ghz
    • 16GB RAM
  • Custom service offering

    • 1 to 6 vCPUs
    • 2.35Ghz
  • VM (deploy config)

    • 1 vCPU
    • 1024MB RAM

The cpu_shares of a VM is calculated as follow:
cpu_shares = number of vCPUs X vCPU frequency in Mhz

Using the command virsh schedinfo --domain <internal_vm_name> the cpu_shares displayed was 2350, which is expected. However, when live scaling the VM to 4 vCPUs, which should return 9400 shares, it was the initial 2350 shares. If the VM is restarted, however, the 9400 shares is used, as it should without the need to restart the VM.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

Using the same configuration and steps I used to describe the bug. Then, when live scaling a VM, it returns the correct cpu_shares without the need to restart it. Therefore, when the VM with the 2350 shares is live scaled to 4 vCPUs, the cpu_shares when using the command mentioned above, returned the correct value of 9400.

Moreover, I created unit tests for the affected methods.

@DaanHoogland DaanHoogland changed the title Update VM priority (cpu_chares) when live scaling it Update VM priority (cpu_shares) when live scaling it Feb 22, 2022
Copy link
Contributor

@GutoVeronezi GutoVeronezi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM, tested manually.

I've pointed just some writing fixes.

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cltgm but some remarks and questions.

protected void updateCpuShares(Domain dm, int newCpuShares) throws LibvirtException {
int oldCpuShares = LibvirtComputingResource.getCpuShares(dm);

if (oldCpuShares < newCpuShares) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only up, never down? is this like nice for users or can the root user force something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ACS, the live scale process does not allow to reduce the resource of a running VM. Therefore, it can only occur upwards and the cpu_shares is calculated as follow:
cpu_shares = number of vCPUs X vCPU frequency in Mhz.
This if (the conditional) is to check if the vCPUs or CPU speed is higher than the current compute offering, as it makes no sense to decrease the priority of a VM that is having its resources increased. In other words, this if is to check if it is a memory scale only, which should not have its priority changed in the host.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks the new speed >= current speed, new cpu core >= current cpu core
so new cpu shares is always equal or bigger than old cpu shares.
the check may be unnecessary but looks ok.

@weizhouapache weizhouapache added this to the 4.17.0.0 milestone Feb 23, 2022
BryanMLima and others added 2 commits February 23, 2022 11:50
Co-authored-by: dahn <[email protected]>
Co-authored-by: Daniel Augusto Veronezi Salvador <[email protected]>
Copy link
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✖️ el8 ✔️ debian ✔️ suse15. SL-JID 2714

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@blueorangutan
Copy link

Trillian test result (tid-3440)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 34779 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6031-t3440-kvm-centos7.zip
Smoke tests completed. 91 look OK, 1 have errors
Only failed tests results shown below:

Test Result Time (s) Test File
test_disable_oobm_ha_state_ineligible Error 1512.05 test_hostha_kvm.py

@BryanMLima
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@BryanMLima a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2730

@BryanMLima
Copy link
Contributor Author

The error was caused while trying to download a Template, so it had no relation to the code itself.
image

@weizhouapache
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2743

@weizhouapache
Copy link
Member

@blueorangutan test

@weizhouapache
Copy link
Member

@blueorangutan test ubuntu20 kvm-ubuntu20

@blueorangutan
Copy link

@weizhouapache a Trillian-Jenkins test job (ubuntu20 mgmt + kvm-ubuntu20) has been kicked to run smoke tests

@weizhouapache
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3456)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32137 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6031-t3456-kvm-centos7.zip
Smoke tests completed. 92 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

Trillian test result (tid-3452)
Environment: kvm-ubuntu20 (x2), Advanced Networking with Mgmt server u20
Total time taken: 36667 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6031-t3452-kvm-ubuntu20.zip
Smoke tests completed. 88 look OK, 4 have errors
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_add_primary_storage_disabled_host Error 0.71 test_primary_storage.py
test_01_primary_storage_nfs Error 0.18 test_primary_storage.py
ContextSuite context=TestStorageTags>:setup Error 0.33 test_primary_storage.py
test_01_migrate_VM_and_root_volume Error 75.34 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 50.01 test_vm_life_cycle.py
test_01_secure_vm_migration Error 161.85 test_vm_life_cycle.py
test_02_unsecure_vm_migration Error 283.74 test_vm_life_cycle.py
test_03_secured_to_nonsecured_vm_migration Error 149.28 test_vm_life_cycle.py
test_08_migrate_vm Error 44.17 test_vm_life_cycle.py
test_02_list_snapshots_with_removed_data_store Error 9.61 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 9.61 test_snapshots.py
test_hostha_kvm_host_degraded Error 696.87 test_hostha_kvm.py
test_hostha_kvm_host_fencing Error 689.52 test_hostha_kvm.py

@weizhouapache
Copy link
Member

@blueorangutan test ubuntu20 kvm-ubuntu20

@blueorangutan
Copy link

@weizhouapache a Trillian-Jenkins test job (ubuntu20 mgmt + kvm-ubuntu20) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3466)
Environment: kvm-ubuntu20 (x2), Advanced Networking with Mgmt server u20
Total time taken: 41641 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6031-t3466-kvm-ubuntu20.zip
Smoke tests completed. 89 look OK, 3 have errors
Only failed tests results shown below:

Test Result Time (s) Test File
test_03_vpc_internallb_haproxy_stats_on_all_interfaces Error 0.10 test_internal_lb.py
test_01_migrate_VM_and_root_volume Error 79.33 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 47.82 test_vm_life_cycle.py
test_10_attachAndDetach_iso Error 12.54 test_vm_life_cycle.py
test_hostha_enable_ha_when_host_disabled Error 1.85 test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance Error 303.01 test_hostha_kvm.py

@BryanMLima
Copy link
Contributor Author

@weizhouapache Could you run the tests again, since I am not in the whitelist? I verified the errors and, apparently, they have no relation to the code I wrote.

@weizhouapache
Copy link
Member

@blueorangutan test ubuntu20 kvm-ubuntu20

@blueorangutan
Copy link

@weizhouapache a Trillian-Jenkins test job (ubuntu20 mgmt + kvm-ubuntu20) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian Build Failed (tid-3519)

@weizhouapache
Copy link
Member

@blueorangutan test ubuntu20 kvm-ubuntu20

@blueorangutan
Copy link

@weizhouapache a Trillian-Jenkins test job (ubuntu20 mgmt + kvm-ubuntu20) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3520)
Environment: kvm-ubuntu20 (x2), Advanced Networking with Mgmt server u20
Total time taken: 36070 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6031-t3520-kvm-ubuntu20.zip
Smoke tests completed. 91 look OK, 1 have errors
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_migrate_VM_and_root_volume Error 70.02 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 49.82 test_vm_life_cycle.py

@apache apache deleted a comment from blueorangutan Mar 8, 2022
@apache apache deleted a comment from blueorangutan Mar 8, 2022
@apache apache deleted a comment from blueorangutan Mar 9, 2022
@apache apache deleted a comment from blueorangutan Mar 9, 2022
@weizhouapache
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2828

@weizhouapache
Copy link
Member

@blueorangutan test ubuntu20 kvm-ubuntu20

@blueorangutan
Copy link

@weizhouapache a Trillian-Jenkins test job (ubuntu20 mgmt + kvm-ubuntu20) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3559)
Environment: kvm-ubuntu20 (x2), Advanced Networking with Mgmt server u20
Total time taken: 34697 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6031-t3559-kvm-ubuntu20.zip
Smoke tests completed. 91 look OK, 1 have errors
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_migrate_VM_and_root_volume Error 88.75 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 50.75 test_vm_life_cycle.py
test_10_attachAndDetach_iso Error 12.54 test_vm_life_cycle.py

@BryanMLima
Copy link
Contributor Author

The recurrent errors test_01_migrate_VM_and_root_volume and test_02_migrate_VM_with_two_data_disks seems to be related to the Ubuntu OS. I tested both situations on my lab, and the command ran smoothly. Moreover, my PR is not related to these commands. @DaanHoogland could you look into this? The test on CentOS didn't shown any errors either.

@Pearl1594
Copy link
Contributor

Ran the above failed tests on a 4.16.1 env - and the same behavior was observed. Test failures as mentioned aren't related to this PR - the specific issue is:

 Failed to migrated vm VM instance {id: "14", name: "i-9-14-VM", uuid: "a1ba2a16-2e59-43c9-bd43-d55309f51a0b", type="User"} along with its volumes. Can't find strategy to move data. Source Host: ref-trl-2732-k-Mu20-pearl-dsilva-kvm1, Destination Host: ref-trl-2732-k-Mu20-pearl-dsilva-kvm2, Volume UUIDs: 0b96a192-0ecc-4de3-84f6-6090230cd4a0

@weizhouapache
Copy link
Member

thanks @Pearl1594 for investigation.

let's merge this PR @nvazquez

@weizhouapache
Copy link
Member

I have created an issue for tracking. #6114
cc @nvazquez @Pearl1594

@nvazquez
Copy link
Contributor

Merging based on approvals and tests results - failing tests are only for ubuntu 20 which are being tracked on #6114

@nvazquez nvazquez merged commit afdc73f into apache:main Mar 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants